# Review Classification

### Key Takeaways!

- How to instantiate a LLM with LangChain and OpenAi.
- How to create a chain with LangChain.
- How to create a Pydantic object with an LLM.

In [None]:
import ast
import os

import pandas as pd

from typing import List
from dotenv import load_dotenv
from pydantic import BaseModel, Field, field_validator

from langchain_openai import AzureChatOpenAI
from langchain_core.runnables import RunnableLambda
from langchain_core.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser


In [73]:
from dotenv import load_dotenv

load_dotenv()

df = pd.read_json("data/simple_reviews.json")
df.head()

Unnamed: 0,date,movie,review
0,2025-09-01,Inception,Amazing visuals and mind-bending plot.
1,2025-09-02,Titanic,The pacing felt too slow.
2,2025-09-03,The Matrix,Innovative action sequences.
3,2025-09-04,Avatar,The story was predictable.
4,2025-09-05,Interstellar,Stunning cinematography.


In [76]:
llm = AzureChatOpenAI(
    azure_deployment="gpt-4.1",
    openai_api_type="azure",
    api_version="2024-12-01-preview",
    azure_endpoint="https://i4talent-openai.openai.azure.com",
    api_key=os.getenv("OPENAI_API_KEY"),
    temperature=0
)


In [None]:

def my_custom_logic(x):
    return x.content.upper() if hasattr(x, "content") else str(x).upper()


custom_step = RunnableLambda(my_custom_logic)

# Define prompts
prompt1 = PromptTemplate.from_template("Perfrom the following instruction within 5 sentences: {input}")
prompt2 = PromptTemplate.from_template("Translate to DUTCH: {input}")



# Chain
chain = (
    prompt1
    | llm
    | my_custom_logic
    | prompt2
    | llm
)

# Run it
result = chain.invoke({"input": "Please explain the LangChain chain opperator and the runnables module!"})
print(result.content)

DE LANGCHAIN **CHAIN OPERATOR** IS EEN MECHANISME DAT HET MOGELIJK MAAKT OM MEERDERE COMPONENTEN (ZOALS PROMPTS, MODELLEN OF TOOLS) AAN ELKAAR TE KOPPELEN TOT EEN SEQUENTIE, WAARBIJ DE UITVOER VAN DE ENE STAP DE INVOER VOOR DE VOLGENDE WORDT. DIT MAAKT HET MOGELIJK OM COMPLEXE WORKFLOWS, ZOALS MEERSTAPS-REASONERING OF GEGEVENSVERWERKING, TE BOUWEN UIT EENVOUDIGE, HERGEBRUIKBARE ONDERDELEN. DE **RUNNABLES MODULE** IN LANGCHAIN BIEDT EEN GESTANDAARDISEERDE INTERFACE VOOR DEZE COMPONENTEN, WAARDOOR HET GEMAKKELIJK IS OM KETENS TE DEFINIËREN, SAMEN TE STELLEN EN UIT TE VOEREN. RUNNABLES KUNNEN ALLES VOORSTELLEN DAT EEN INVOER NEEMT EN EEN UITVOER GEEFT, ZOALS TAALMODELLEN, FUNCTIES OF ZELFS VOLLEDIGE KETENS. DOOR DE CHAIN OPERATOR EN RUNNABLES SAMEN TE GEBRUIKEN, KUNNEN ONTWIKKELAARS FLEXIBEL GEAVANCEERDE TAALGEBASEERDE TOEPASSINGEN BOUWEN EN BEHEREN.


## Case Study:

You are hired to label these movie reviews into three categories. The owner of the cinema would like you to apply three categories:
- Good
- Neutral
- Bad

To achieve this goal we will leverage a pre-trained LLM model. 

## Chain and Runnables

In [None]:
def get_category(llm, review: str, valid_categories: List[str]) -> str:
    categories_str = ", ".join(valid_categories)
    prompt = (
        f"Categorize the following movie review as one of the following: {categories_str}. "
        "Only return the category name as a string.\n"
        f"Review: {review}"
    )
    response = llm.invoke(prompt)
    classification = response.content.strip()
    return classification if classification in valid_categories else "Unknown"

def categorize_reviews(df: pd.DataFrame, llm, valid_categories: List[str], review_col: str) -> pd.DataFrame:
    df['Category'] = df[review_col].apply(
        lambda review: get_category(llm, review, valid_categories)
    )
    return df

VALID_CATEGORIES = ["Bad", "Neutral", "Good"]
df_categorized = (
    df
    .pipe(categorize_reviews, llm=llm, valid_categories=VALID_CATEGORIES, review_col = "review")
)

## Complex Reviews

We don't only have simple review, but also more complex reviews that need more processing!

In [None]:

df_complex = pd.read_json("data/complex_reviews.json")
df_complex.index.name = "review_id"
df_complex = df_complex.reset_index()
df_complex.head()

Unnamed: 0,review_id,date,movie,review
0,0,2025-09-01,Inception,"Visually stunning and intellectually rich, tho..."
1,1,2025-09-02,Titanic,Emotional but slow-paced. Theater decor matche...
2,2,2025-09-03,The Matrix,Groundbreaking action and ideas. Laser project...
3,3,2025-09-04,Avatar,"Beautiful visuals, predictable story. Jungle-t..."


### Abstract Syntax Tree

In [None]:
response = llm.invoke("Generate a python list withh 5 elements. Do not show anything else")
unprosseced_list = response.content
processed_list = ast.literal_eval(unprosseced_list)
print(f"unprocessed_list: {unprosseced_list}, type: {type(unprosseced_list)}")
print(f"processed_list: {processed_list}, type: {type(processed_list)}")

print(ast.dump(ast.parse(unprosseced_list), indent = 4))

unprocessed_list: [1, 2, 3, 4, 5], type: <class 'str'>
processed_list: [1, 2, 3, 4, 5], type: <class 'list'>
Module(
    body=[
        Expr(
            value=List(
                elts=[
                    Constant(value=1),
                    Constant(value=2),
                    Constant(value=3),
                    Constant(value=4),
                    Constant(value=5)],
                ctx=Load()))],
    type_ignores=[])


In [None]:
def split_review_into_topics(llm, review: str) -> list:
    """
    Uses LLM to split a review into related topics.
    Returns a list of dicts: [{'topic_id': str, 'topic_text': str}]
    """
    prompt = (
        "Break the following movie review into its related topics or aspects. "
        "Return each topic as a separate string in a Python list. "
        "Do not summarize, just split.\n"
        f"Review: {review}"
    )
    response = llm.invoke(prompt)
    topics = ast.literal_eval(response.content)
    return topics

def drop_col(df: pd.DataFrame, col: str) -> pd.DataFrame:
    return df.drop(columns=[col])

def add_topics_column(df: pd.DataFrame, llm) -> pd.DataFrame:
    """
    Adds a 'Topics' column to the DataFrame by splitting each review into topics using the LLM.
    """
    df['topics'] = df['review'].apply(lambda review: split_review_into_topics(llm, review))
    df = df.explode('topics').reset_index(drop=True)
    return df

df_processed = (
    df_complex
    .pipe(add_topics_column, llm=llm)
    .pipe(drop_col, "review")
    .pipe(categorize_reviews, llm=llm, valid_categories=VALID_CATEGORIES, review_col = "topics")
)
df_processed.head()

Unnamed: 0,review_id,date,movie,topics,Category
0,0,2025-09-01,Inception,Visually stunning and intellectually rich,Good
1,0,2025-09-01,Inception,broken recliner,Bad
2,0,2025-09-01,Inception,Staff were helpful,Neutral
3,0,2025-09-01,Inception,popcorn stale,Bad
4,1,2025-09-02,Titanic,Emotional but slow-paced,Neutral


In [None]:
class Ticket(BaseModel):
    employee: str = Field(..., description="The name of the employee assigned to the ticket")
    topic: str = Field(..., description="The topic or aspect of the review that needs attention")
    suggested_improvement: str = Field(..., description="The suggested improvement for the topic")


parser = PydanticOutputParser(pydantic_object=Ticket)
format_instructions = parser.get_format_instructions()
print(format_instructions)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"employee": {"description": "The name of the employee assigned to the ticket", "title": "Employee", "type": "string"}, "topic": {"description": "The topic or aspect of the review that needs attention", "title": "Topic", "type": "string"}, "suggested_improvement": {"description": "The suggested improvement for the topic", "title": "Suggested Improvement", "type": "string"}}, "required": ["employee", "topic", "suggested_improvement"]}
```


In [None]:
import pandas as pd
from typing import List
from pydantic import BaseModel
from langchain.output_parsers import PydanticOutputParser
from pydantic import Field
from enum import Enum

class Department(str, Enum):
    """Departments in movie Theater"""
    FACILITIES = "facilities"
    FOOD = "food"
    MOVIE_PROGRAM = "movie program"


class Employee(BaseModel):
    """Employee in movie theater"""
    name: str
    department: Department


class Staff(list):
    def __init__(self, employees):
        super().__init__(employees)
        self._lookup = {e.name: e for e in employees}

    def is_valid_employee(self, name: str) -> bool:
        return name in self._lookup


staff = Staff([
    Employee(name="Alice", department=Department.FACILITIES),
    Employee(name="Bob", department=Department.FOOD),
    Employee(name="Chris", department=Department.MOVIE_PROGRAM)
])


class Ticket(BaseModel):
    """Ticket for suggested imporovemnt"""
    employee_name: str = Field(..., description="The name of the employee assigned to the ticket")
    topic: str = Field(..., description="The topic or aspect of the review that needs attention")
    suggested_improvement: str = Field(..., description="The suggested improvement for the topic")
    
    @field_validator("employee_name")
    def validate_employee(cls, v):
        if not staff.is_valid_employee(v):
            raise ValueError(f"Employee '{v}' is not in the staff list.")
        return v


In [None]:

def generate_ticket(llm, topic: str, staff: List[Employee]) -> Ticket:
    parser = PydanticOutputParser(pydantic_object=Ticket)
    staff_info = "\n".join([f"- {e.name} ({e.department})" for e in staff])
    format_instructions = parser.get_format_instructions()
    prompt_template = PromptTemplate.from_template(
        "You are assigning a ticket based on the following topic:\n"
        "Topic: {topic}\n\n"
        "The available staff are:\n{staff_info}\n\n"
        "{format_instructions}\n"
        "Choose the most suitable employee and suggest an improvement."
    )
    chain = prompt_template | llm | parser
    response = chain.invoke({
        "topic": topic,
        "staff_info": staff_info,
        "format_instructions": format_instructions
    })
    return response

def create_tickets(df: pd.DataFrame, llm, staff: List[Employee]) -> List[Ticket]:
    tickets = []
    for _, row in df.iterrows():
        topic = row['topics']
        ticket = generate_ticket(llm, topic, staff)
        tickets.append(ticket)
    return tickets

def process_reviews_and_generate_tickets(df: pd.DataFrame, llm, staff: List[Employee]) -> pd.DataFrame:
    df_final = (
        df
        .pipe(add_topics_column, llm=llm)
        .pipe(drop_col, "review")
        .pipe(categorize_reviews, llm=llm, valid_categories=VALID_CATEGORIES,  review_col = "topics")
    )
    tickets = create_tickets(df_final, llm=llm, staff=staff)
    return pd.DataFrame([t.model_dump() for t in tickets])

df_result = process_reviews_and_generate_tickets(df_complex, llm, staff)
df_result.head()

Unnamed: 0,employee_name,topic,suggested_improvement
0,Chris,Visually stunning and intellectually rich,Enhance the selection of films to include more...
1,Alice,broken recliner,Inspect and repair or replace the broken recli...
2,Alice,Staff were helpful,Implement a recognition program to further enc...
3,Bob,popcorn stale,Implement a freshness monitoring system for po...
4,Chris,Emotional but slow-paced,Consider incorporating more dynamic elements o...


## Task

Alice and Bob where really excited about the new ticket system, but would wanted to know what are quick wins. And Chris feels like it is not suitable for him. He requested an alternative format where he gets a summery with the following topic:
- Overal Story
- Actor Performance
- Special Effect

Improve and/or extend the pipeline so we can construct the right summary for Chris.


