This notebook demonstrates how to extract structured information from a product review using Python, regular expressions, and a BERT transformer model for sentiment analysis. 

It defines a schema for the structured output using Python's TypedDict with type annotations, then processes a sample review to extract key themes, a summary, sentiment (positive or negative), pros, cons, and the reviewer's name. 

The notebook uses regex to parse the review text, identifies relevant keywords, and leverages HuggingFace's BERT model to classify the sentiment. 

The final output is a structured dictionary containing all extracted and inferred information, making it suitable for downstream tasks such as analytics or database storage.

In [23]:
# Import necessary libraries: HuggingFace Transformers for BERT, PyTorch, regex, typing for type hints, and dotenv for environment variables.
from transformers import BertForSequenceClassification, BertTokenizer
import torch
import re
from typing import TypedDict, Annotated, Optional, Literal
from dotenv import load_dotenv
load_dotenv()

True

In [24]:
# Define a TypedDict schema for structured review output, with annotations for each field.
class Review(TypedDict):
    key_themes: Annotated[list[str], "Key themes in the review"]
    summary: Annotated[str, "Brief summary"]
    sentiment: Annotated[Literal["pos", "neg"], "Sentiment: pos or neg"]
    pros: Annotated[Optional[list[str]], "Pros as a list"]
    cons: Annotated[Optional[list[str]], "Cons as a list"]
    name: Annotated[Optional[str], "Reviewer name"]

In [25]:
# Load the BERT tokenizer and model for sequence classification from HuggingFace.
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
# Example review text to be processed and structured.
review_text = """I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! The Snapdragon 8 Gen 3 processor makes everything lightning fast—whether I’m gaming, multitasking, or editing photos. The 5000mAh battery easily lasts a full day even with heavy use, and the 45W fast charging is a lifesaver.

The S-Pen integration is a great touch for note-taking and quick sketches, though I don't use it often. What really blew me away is the 200MP camera—the night mode is stunning, capturing crisp, vibrant images even in low light. Zooming up to 100x actually works well for distant objects, but anything beyond 30x loses quality.

However, the weight and size make it a bit uncomfortable for one-handed use. Also, Samsung’s One UI still comes with bloatware—why do I need five different Samsung apps for things Google already provides? The $1,300 price tag is also a hard pill to swallow.

Pros:
Insanely powerful processor (great for gaming and productivity)
Stunning 200MP camera with incredible zoom capabilities
Long battery life with fast charging
S-Pen support is unique and useful
                                 
Review by saifullah
"""

In [27]:
# Helper function to extract a list of items from text using a regex pattern.
def extract_list(pattern, text):
    match = re.search(pattern, text, re.DOTALL)
    return [line.strip() for line in match.group(1).split('\n') if line.strip()] if match else []


In [28]:
# Extract pros from the review text using the helper function.
pros = extract_list(r"Pros:\s*(.*?)(?:\n\n|$)", review_text)


In [29]:
# Extract cons by searching for sentences after "However,".
cons = []
cons_match = re.search(r"However,(.*?)(?:\n\n|$)", review_text, re.DOTALL)
if cons_match:
    cons = [sent.strip() for sent in cons_match.group(1).split('.') if sent.strip()]


In [30]:
# Extract reviewer name if present.
name_match = re.search(r"Review by ([\w\s]+)", review_text)
name = name_match.group(1).strip() if name_match else None


In [31]:
# Identify key themes by checking for keywords in the review text.
themes = [kw for kw in ["camera", "battery", "processor", "S-Pen", "price", "bloatware", "One UI"]
          if kw.lower() in review_text.lower()]


In [32]:
# Create a summary from the first two lines of the review.
summary = " ".join(review_text.strip().split('\n')[0:2])


In [33]:
# Tokenize the review text and run it through the BERT model to get sentiment.
inputs = tokenizer(review_text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
    logits = model(**inputs).logits
    sentiment = ["neg", "pos"][logits.argmax().item()]


In [34]:
# Assemble the structured review dictionary and print the result.
structured_review: Review = {
    "key_themes": themes,
    "summary": summary,
    "sentiment": sentiment,
    "pros": pros,
    "cons": cons,
    "name": name
}


In [35]:
print("Structured Review Output:")
print(structured_review)

Structured Review Output:
{'key_themes': ['camera', 'battery', 'processor', 'S-Pen', 'price', 'bloatware', 'One UI'], 'summary': 'I recently upgraded to the Samsung Galaxy S24 Ultra, and I must say, it’s an absolute powerhouse! The Snapdragon 8 Gen 3 processor makes everything lightning fast—whether I’m gaming, multitasking, or editing photos. The 5000mAh battery easily lasts a full day even with heavy use, and the 45W fast charging is a lifesaver. ', 'sentiment': 'neg', 'pros': ['Insanely powerful processor (great for gaming and productivity)', 'Stunning 200MP camera with incredible zoom capabilities', 'Long battery life with fast charging', 'S-Pen support is unique and useful'], 'cons': ['the weight and size make it a bit uncomfortable for one-handed use', 'Also, Samsung’s One UI still comes with bloatware—why do I need five different Samsung apps for things Google already provides? The $1,300 price tag is also a hard pill to swallow'], 'name': 'saifullah'}
