# Customer Review Classification with GoogleGemini

This notebook demonstrates an end-to-end flow for performing text classification on Disneyland customer reviews using a GoogleGemini model over LangChain.

---
## Objectives
1. Download and prepare the Disneyland reviews dataset.
2. Convert the data to a LLM-compatible format.
3. Define a structured output schema using Pydantic.
4. Run LLM inference incrementally on a batch of data.
5. Combine the results and prepare for advanced analysis.

 ---
 ## List of contents
 1. [Setup Environment](#setup-environment)
 2. [Data Collection](#data-collection)
 3. [Data Preparation](#data-preparation)
 4. [LLM Initialization](#llm-initialization)
 5. [Output Schema Definition](#output-structure-definition)
 6. [Input Preparation](#input-preparation)
 7. [LLM Execution & Consolidation](#execution-of-llm-and-data-consolidation)

## Environment Setup <a id="setup-environment"></a>
This section installs the required libraries (LangChain and other dependencies) and performs the initial configuration.

In [None]:
%%bash
pip install -qU "langchain[google-genai]" pydantic

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.4/47.4 kB 1.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 17.7 MB/s eta 0:00:00


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-generativeai 0.8.5 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativelanguage 0.6.18 which is incompatible.


In [None]:
from google.colab import drive
import warnings

### Connecting Kaggle API <a id="data-collection"></a>
This step connects the notebook to a Kaggle account so that you can download datasets directly without any manual process. Make sure that the `kaggle.json` token is copied to Google Drive before running the following code cell.

In [None]:
drive.mount('/content/drive')
warnings.filterwarnings('ignore')

Mounted at /content/drive


In [None]:
%%bash
mkdir p ~/.kaggle
cp /content/drive/MyDrive/colab_notebooks/Project/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
kaggle datasets download arushchillar/disneyland-reviews
unzip /content/disneyland-reviews.zip

Dataset URL: https://www.kaggle.com/datasets/arushchillar/disneyland-reviews
License(s): CC0-1.0
Downloading disneyland-reviews.zip to /content

Archive:  /content/disneyland-reviews.zip
  inflating: DisneylandReviews.csv   


  0%|          | 0.00/11.1M [00:00<?, ?B/s]100%|██████████| 11.1M/11.1M [00:00<00:00, 208MB/s]


### Loading Dataset to DataFrame <a id="data-preparation"></a>
The original dataset is downloaded in CSV format. At this stage we load it into **pandas DataFrame** to make it easier to process. Make sure the encoding is set to `ISO-8859-1` because after I checked the encoding using chardet the result was this data encoding is ISO-8859-1

In [None]:
import pandas as pd

df = pd.read_csv('/content/DisneylandReviews.csv', encoding= 'ISO-8859-1')
df.head()

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong


### Data Sampling
For efficiency, only 100 random samples are taken using `pandas.DataFrame.sample()`. Sampling is useful for fast exploration or iterative processes with LLMs that have token limits.

In [None]:
import numpy as np

df_sampled = df[['Review_ID', 'Review_Text']].sample(n= 100, random_state=42)
partition = np.array_split(df_sampled, 10)

## LLM Initialization <a id="llm-initialization"></a>
In this section, we initialize **Google Gemini** via the LangChain interface. Make sure the API credentials are configured in the environment variables.

In [None]:
from langchain.chat_models import init_chat_model
from google.colab import userdata
import os

# Set the API token
if not os.environ.get("GOOGLE_API_KEY"):
  api_token = userdata.get('gemini_api')
  os.environ["GOOGLE_API_KEY"] = api_token

# Model setup
llm = init_chat_model(
    model="gemini-2.5-flash-lite-preview-06-17",
    model_provider="google_genai",
    temperature=0.1, # Low temperature to be deterministic, but we also need a little creativity for the "summary" column
    max_tokens=3000
)

## Structured Output Schema Definition <a id="output-structure-definition"></a>
Using **Pydantic**, we define a `ReviewClassification` model to ensure that LLM produces output in a format that is easy to parse and further analyze.

In [None]:
from enum import Enum
from typing import List
from pydantic import (
    BaseModel,
    Field,
    conlist,
    confloat,
    constr,
)

# ----------------------------------------------------------------------- #
# Enumerations                                                            #
# ----------------------------------------------------------------------- #

class Sentiment(str, Enum):
    """Overall polarity of the review text."""
    positive = "positive"
    neutral  = "neutral"
    negative = "negative"


class PriorityLevel(str, Enum):
    """Business urgency implied by the review."""
    low    = "low"
    medium = "medium"
    high   = "high"


class Emotion(str, Enum):
    """Dominant emotion expressed in the review text."""
    joy      = "joy"
    anger    = "anger"
    sadness  = "sadness"
    fear     = "fear"
    surprise = "surprise"
    neutral  = "neutral"


# ----------------------------------------------------------------------- #
# Core Model: one object per review                                       #
# ----------------------------------------------------------------------- #

class ReviewAnalysis(BaseModel):
    """
    Structured representation of a single customer review about **recreation areas** (parks, trails, beaches, campgrounds, theme parks, playgrounds).

    Attributes
    ----------
    review_id : str
        Original row identifier from the source CSV.
    summary : str
        One-sentence abstract capturing the essence of the review.
    priority_level : PriorityLevel
        Business urgency – escalate `"high"` items first (e.g. safety hazards).
    sentiment : Sentiment
        High-level polarity of the review (`positive`, `neutral`, `negative`).
    sentiment_score : float
        Polarity score in the range **−1.00 → 1.00** (rounded to 2 decimals).
    dominant_emotion : Emotion
        Primary emotion inferred from the text.
    tags : List[str]
        **2–5** salient keywords extracted verbatim or lightly normalised.
    topics : List[str]
        Review aspects covered (e.g. `"cleanliness"`, `"facilities"`).
    language : str
        Two-letter ISO-639-1 code of the detected language (lower-case).
    """

    review_id: str = Field(
        ...,
        description="Original row ID from the dataset."
    )
    summary: str = Field(
        ...,
        description="Concise one-sentence summary of the review."
    )
    priority_level: PriorityLevel = Field(
        ...,
        description="Urgency level inferred from the issues raised."
    )
    sentiment: Sentiment = Field(
        ...,
        description="Overall sentiment classification."
    )
    sentiment_score: confloat(ge=-1.0, le=1.0) = Field(
        ...,
        description="Float between -1.00 and 1.00, rounded to two decimals."
    )
    dominant_emotion: Emotion = Field(
        ...,
        description="Primary emotion conveyed in the review."
    )
    tags: conlist(str, min_length=2) = Field(
        ...,
        description="List of 2–5 keywords highlighting key points."
    )
    topics: conlist(str, min_length=1) = Field(
        ...,
        description="Main aspect(s) discussed (cleanliness, facilities, etc.)."
    )
    language: constr() = Field(
        ...,
        description="ISO-639-1 language code (e.g. en, id, fr)."
    )

# ----------------------------------------------------------------------- #
# Batch wrapper: preserves row order when parsing many reviews at once    #
# ----------------------------------------------------------------------- #

class ReviewBatch(BaseModel):
    """
    Wrapper for a list of ordered `ReviewAnalysis` entries.

    The outer wrapper allows LangChain’s `with_structured_output()` to
    parse multiple rows in **one** LLM call while enforcing global rules:
    the sequence of objects must match the original input order.
    """
    reviews: List[ReviewAnalysis] = Field(
        ...,
        description="Ordered list of per-review analyses."
    )


# ----------------------------------------------------------------------- #
# integration with LangChain                                              #
# ----------------------------------------------------------------------- #

structured_llm = llm.with_structured_output(ReviewBatch)

## Preparing Input for LLM <a id="input-preparation"></a>
DataFrame is converted to a `CSV` string so that it can be consumed by LLM as a `prompt / input`.

### DataFrame to CSV Conversion Function
The following function converts a Pandas DataFrame to a CSV text/string.

In [None]:
import csv
import io

def df_to_prompt_input(df: pd.DataFrame) -> str:
    """Convert a DataFrame to a fenced CSV string suitable for LangChain PromptTemplate interpolation.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame to convert.

    Returns
    -------
    str
        A string containing the CSV representation of the DataFrame wrapped in a fenced CSV code block.
    """
    csv_buffer = io.StringIO()
    # Use pandas to handle CSV serialization, preserving column order and quoting as needed.
    df.to_csv(csv_buffer, index=False, quoting=csv.QUOTE_MINIMAL, lineterminator="\n")
    csv_text = csv_buffer.getvalue().rstrip("\n")  # Remove any trailing newline for cleanliness.
    return csv_text

## LLM Execution & Results Consolidation <a id="execution-of-llm-and-data-consolidation"></a>
After all _batches_ are processed, the results are collected into the list `all_results` and flattened into a final DataFrame for further analysis.

In [None]:
from tqdm.auto import tqdm


# ------- Loop 1: Invoke LLM -------
all_results = []
part_bar = tqdm(
    partition,
    desc="⏳ Memproses batch",
    total=len(partition),
    bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} • {elapsed}<{remaining} @ {rate_fmt}{postfix}",
    colour="#FFB200",            # Yellow
    dynamic_ncols=True,          # The width of the bar adjusts to the window
)

for part in part_bar:
    input_text = df_to_prompt_input(part)
    result = structured_llm.invoke(input_text)
    all_results.append(result)

    # For additional info at the bar
    part_bar.set_postfix({'latest_id': part.index[0]})

# ------- Loop 2: Parsing the output -------
processed_data = []
result_bar = tqdm(
    all_results,
    desc="🔄 Menguraikan hasil",
    bar_format="{l_bar}{bar}| {percentage:3.0f}% • {n_fmt}/{total_fmt} • {elapsed} • ETA {remaining}",
    colour="#006BFF",            # Blue
    dynamic_ncols=True,
)

for result in result_bar:
    for review in result.reviews:
        processed_data.append({
            "review_id":       review.review_id,
            "summary":         review.summary,
            "priority_level":  review.priority_level.value,
            "sentiment":       review.sentiment.value,
            "sentiment_score": review.sentiment_score,
            "dominant_emotion":review.dominant_emotion.value,
            "tags":            review.tags,
            "topics":          review.topics,
            "language":        review.language,
        })

# ------- Final DataFrame -------
df_processed = pd.DataFrame(processed_data)
display(df_processed.head())


⏳ Memproses batch:   0%|          | 0/10 • 00:00<? @ ?it/s

🔄 Menguraikan hasil:   0%|          |   0% • 0/10 • 00:00 • ETA ?

Unnamed: 0,review_id,summary,priority_level,sentiment,sentiment_score,dominant_emotion,tags,topics,language
0,540713188,Disneyland is a beautiful and large park that ...,low,positive,0.9,joy,"[beautiful, large, crowds, organized, holidays]","[park, crowds, organization]",en
1,119781124,The long wait times for rides at Disneyland ar...,high,negative,-0.7,anger,"[lines, long, fast passes, ridiculous, worse]","[rides, wait times, crowds]",en
2,576395715,"Enjoyed Hong Kong Disneyland, finding it a hap...",low,positive,0.8,joy,"[loved, smaller, no queues, happy atmosphere, ...","[park, queues, atmosphere]",en
3,310041955,Disneyland is a beloved park that evokes nosta...,low,positive,0.9,joy,"[love, annual pass, nostalgia, better, history]","[park, nostalgia, comparison]",en
4,184009554,California Adventure Park has improved with Ca...,medium,positive,0.7,neutral,"[improved, Carsland, match, off season, short ...","[park, improvements, wait times, weather]",en
