# TNT LLM - Text mining at scale

In this notebook, we propose to review step by step the code proposed in this [documentation article](https://langchain-ai.github.io/langgraph/tutorials/tnt-llm/tnt-llm/#3a-taxonomy-generation-utilities)

The LLM worklfow is composed of several step in order to build a **taxonomy** from a base of unlabeled free-text.
    



In [1]:
# Prepare requirements 

In [2]:
%pip install -U langgraph langchain_anthropic langsmith langchain-community
%pip install -U sklearn langchain_openai pandas

Note: you may need to restart the kernel to use updated packages.
Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track w

In [3]:
import logging
import operator
from typing import Annotated, List, Optional
from typing_extensions import TypedDict

logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger("tnt-llm")


class Doc(TypedDict):
    id: str
    content: str
    summary: Optional[str]
    explanation: Optional[str]
    category: Optional[str]


class TaxonomyGenerationState(TypedDict):
    # The raw docs; we inject summaries within them in the first step
    documents: List[Doc]
    # Indices to be concise
    minibatches: List[List[int]]
    # Candidate Taxonomies (full trajectory)
    clusters: Annotated[List[List[dict]], operator.add]

In [4]:
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-api"

In [5]:
MODEL_NAME = "claude-3-5-haiku-20241022"

## Step 0 - Dataset to experiment with [Amazon food reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews)

In [6]:
import pandas as pd
df = pd.read_csv("../../data/amazon-fine-food-reviews/Reviews.csv")
df

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
568449,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,5,1299628800,Will not do without,Great for sesame chicken..this is a good if no...
568450,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,2,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...
568451,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,5,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o..."
568452,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,5,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...


In [7]:
def create_docs(df: pd.DataFrame):
    for x in df.itertuples():
        yield {"id": x.Id, "summary": x.Summary, "content": x.Text}
docs = list(create_docs(df))
docs[0]

{'id': 1,
 'summary': 'Good Quality Dog Food',
 'content': 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'}

## Step 1 : Build summary of documents 

In TNT LLM, the documents must be summarized in order to fit the context window

In [9]:
import re

from langchain import hub
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableConfig, RunnableLambda, RunnablePassthrough

summary_prompt = hub.pull("wfh/tnt-llm-summary-generation").partial(
    summary_length=20, explanation_length=30
)


def get_content(state: TaxonomyGenerationState):
    docs = state["documents"]
    return [{"content": doc["content"]} for doc in docs]


def parse_summary(xml_string: str) -> dict:
    summary_pattern = r"<summary>(.*?)</summary>"
    explanation_pattern = r"<explanation>(.*?)</explanation>"

    summary_match = re.search(summary_pattern, xml_string, re.DOTALL)
    explanation_match = re.search(explanation_pattern, xml_string, re.DOTALL)

    summary = summary_match.group(1).strip() if summary_match else ""
    explanation = explanation_match.group(1).strip() if explanation_match else ""

    return {"summary": summary, "explanation": explanation}


def reduce_summaries(combined: dict) -> TaxonomyGenerationState:
    summaries = combined["summaries"]
    documents = combined["documents"]
    return {
        "documents": [
            {
                "id": doc["id"],
                "content": doc["content"],
                "summary": summ_info["summary"],
                "explanation": summ_info["explanation"],
            }
            for doc, summ_info in zip(documents, summaries)
        ]
    }


# Now combine as a "map" operation in a map-reduce chain
# Input: state
# Output: state U summaries
# Processes docs in parallel
summary_llm_chain = (
    summary_prompt | ChatAnthropic(model=MODEL_NAME) | StrOutputParser()
    # Customize the tracing name for easier organization
).with_config(run_name="GenerateSummary")


summary_chain = summary_llm_chain | parse_summary

# This effectively creates a "map" operation
# Note you can make this more robust by handling individual errors
map_step = RunnablePassthrough.assign(
    summaries=get_content | RunnableLambda(func=summary_chain.batch, afunc=summary_chain.abatch)
)


# This is actually the node itself!
map_reduce_chain = map_step | reduce_summaries



In [10]:
# Test the document summary part
map_reduce_chain.invoke({"documents": docs[:5]})

{'documents': [{'id': 1,
   'content': 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
   'summary': 'Dog owner praises Vitality canned food for quality, texture, and appeal to finicky Labrador.',
   'explanation': "I extracted key aspects: product quality, food characteristics, and positive dog's reaction while staying within 20 words."},
  {'id': 2,
   'content': 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
   'summary': 'Customer received small unsalted peanuts instead of labeled jumbo salted peanuts, questioning product accuracy.',
   'explanation': 'I extracted key points: product mismatch, size discrepancy, salt conten

## Step 2 - Generate a first taxonomy

The taxonomy is the core of the topic modeling.

It is build incrementally but we need a first one to start.




```
human

# Questions

## Q1. Please generate a cluster table from the input data that meets the requirements.

Tips

- The cluster table should be a **flat list** of **mutually exclusive** categories. Sort them based on their semantic relatedness.

- Though you should aim for {max_num_clusters} categories, you can have *fewer than {max_num_clusters} categories* in the cluster table;  but **do not exceed the limit.** 

- Be **specific** about each category. **Do not include vague categories** such as "Other", "General", "Unclear", "Miscellaneous" or "Undefined" in the cluster table.

- You can ignore low quality or ambiguous data points.

## Q2. Why did you cluster the data the way you did? Explain your reasoning **within {explanation_length} words**.

## Provide your answers between the tags: <cluster_table>your generated cluster table with no more than {max_num_clusters} categories</cluster_table>, <explanation>explanation of your reasoning process within {explanation_length} words</explanation>.

# Output
```

**We need some code to parse input and output of the LLM calls**

In [11]:
from typing import Dict

from langchain_core.runnables import Runnable


def parse_taxa(output_text: str) -> Dict:
    """Extract the taxonomy from the generated output."""
    cluster_matches = re.findall(
        r"\s*<id>(.*?)</id>\s*<name>(.*?)</name>\s*<description>(.*?)</description>\s*",
        output_text,
        re.DOTALL,
    )
    clusters = [
        {"id": id.strip(), "name": name.strip(), "description": description.strip()}
        for id, name, description in cluster_matches
    ]
    # We don't parse the explanation since it isn't used downstream
    return {"clusters": clusters}



def format_docs(docs: List[Doc]) -> str:
    xml_table = "<conversations>\n"
    for doc in docs:
        xml_table += f'<conv_summ id={doc["id"]}>{doc["summary"]}</conv_summ>\n'
    xml_table += "</conversations>"
    return xml_table


def format_taxonomy(clusters):
    xml = "<cluster_table>\n"
    for label in clusters:
        xml += "  <cluster>\n"
        xml += f'    <id>{label["id"]}</id>\n'
        xml += f'    <name>{label["name"]}</name>\n'
        xml += f'    <description>{label["description"]}</description>\n'
        xml += "  </cluster>\n"
    xml += "</cluster_table>"
    return xml




In [12]:
import random


# We will share an LLM for each step of the generate -> update -> review cycle
# You may want to consider using Opus or another more powerful model for this
taxonomy_generation_llm = ChatAnthropic(
    model=MODEL_NAME, max_tokens_to_sample=2000
)


## Initial generation
taxonomy_generation_prompt = hub.pull("wfh/tnt-llm-taxonomy-generation").partial(
    use_case="Generate the taxonomy that can be used to label the user intent in the conversation.",
)

taxa_gen_llm_chain = (
    taxonomy_generation_prompt | taxonomy_generation_llm | StrOutputParser()
).with_config(run_name="GenerateTaxonomy")


generate_taxonomy_chain = taxa_gen_llm_chain | parse_taxa




**We use the summary of the data in order to test our function**

In [13]:
# Docs are already with a summary, no need to pay the cost of LLM on all of them
docs[0]

{'id': 1,
 'summary': 'Good Quality Dog Food',
 'content': 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'}

In [14]:
first_taxonomy_str = taxa_gen_llm_chain.invoke({
            "data_xml": format_docs(docs[:10]),
            "use_case": "Find the categories of product review",
            "cluster_table_xml": "",
            "suggestion_length": 30,
            "cluster_name_length": 10,
            "cluster_description_length": 30,
            "explanation_length": 20,
            "max_num_clusters": 25,
        })

In [15]:
print(first_taxonomy_str)

<cluster_table>
<clusters>
  <cluster>
    <id>1</id>
    <name>Pet Food Quality Evaluation</name>
    <description>Reviews focusing on nutritional value and overall quality of dog food products</description>
  </cluster>
  <cluster>
    <id>2</id>
    <name>Product Expectation Mismatch</name>
    <description>Reviews highlighting discrepancies between product advertising and actual product performance</description>
  </cluster>
  <cluster>
    <id>3</id>
    <name>Positive Taste Experience</name>
    <description>Reviews expressing delight and satisfaction with the flavor of food products</description>
  </cluster>
  <cluster>
    <id>4</id>
    <name>Health and Wellness Products</name>
    <description>Reviews related to medicinal or health-supporting products like cough medicine</description>
  </cluster>
</clusters>
</cluster_table>

<explanation>Clustered based on common themes: food quality, expectations, taste satisfaction, and health products.</explanation>


In [33]:
taxonomy = parse_taxa(first_taxonomy_str)
taxonomy

{'clusters': [{'id': '1',
   'name': 'Pet Food Quality Evaluation',
   'description': 'Reviews focusing on nutritional value and overall quality of dog food products'},
  {'id': '2',
   'name': 'Product Expectation Mismatch',
   'description': 'Reviews highlighting discrepancies between product advertising and actual product performance'},
  {'id': '3',
   'name': 'Positive Taste Experience',
   'description': 'Reviews expressing delight and satisfaction with the flavor of food products'},
  {'id': '4',
   'name': 'Health and Wellness Products',
   'description': 'Reviews related to medicinal or health-supporting products like cough medicine'}]}

## Step 3 - Redo it but in batch mode

We prepae for the scaling to the (almost) full dataset


In [95]:
def invoke_taxonomy_chain(
    chain: Runnable,
    state: TaxonomyGenerationState,
    config: RunnableConfig,
    mb_indices: List[int],
) -> TaxonomyGenerationState:

    # Get config to get later the length of prompt response
    configurable = config["configurable"]
    
    # Get docs of the batch in the right format
    docs = state["documents"]
    minibatch = [docs[idx] for idx in mb_indices]
    data_table_xml = format_docs(minibatch)

    # Prepare old taxo for the prompt
    previous_taxonomy = state["clusters"][-1] if state["clusters"] else []
    cluster_table_xml = format_taxonomy(previous_taxonomy)

    updated_taxonomy = chain.invoke(
        {
            "data_xml": data_table_xml,
            "use_case": configurable["use_case"],
            "cluster_table_xml": cluster_table_xml,
            "suggestion_length": configurable.get("suggestion_length", 30),
            "cluster_name_length": configurable.get("cluster_name_length", 10),
            "cluster_description_length": configurable.get(
                "cluster_description_length", 30
            ),
            "explanation_length": configurable.get("explanation_length", 20),
            "max_num_clusters": configurable.get("max_num_clusters", 25),
        }
    )

    if type(updated_taxonomy) != str and "clusters" in updated_taxonomy.keys():
        return {
            "clusters": [updated_taxonomy["clusters"]],
        }
    else:
        return updated_taxonomy


def generate_taxonomy(
    state: TaxonomyGenerationState, config: RunnableConfig
) -> TaxonomyGenerationState:
    return invoke_taxonomy_chain(
        generate_taxonomy_chain, state, config, state["minibatches"][0]
    )



def get_minibatches(state: TaxonomyGenerationState, config: RunnableConfig):
    batch_size = config["configurable"].get("batch_size", 15)
    original = state["documents"]
    indices = list(range(len(original)))
    random.shuffle(indices)
    if len(indices) < batch_size:
        # Don't pad needlessly if we can't fill a single batch
        return [indices]

    num_full_batches = len(indices) // batch_size

    batches = [
        indices[i * batch_size : (i + 1) * batch_size] for i in range(num_full_batches)
    ]

    leftovers = len(indices) % batch_size
    if leftovers:
        last_batch = indices[num_full_batches * batch_size :]
        elements_to_add = batch_size - leftovers
        last_batch += random.sample(indices, elements_to_add)
        batches.append(last_batch)

    return {
        "minibatches": batches,
    }

In [17]:
"""
class TaxonomyGenerationState(TypedDict):
    # The raw docs; we inject summaries within them in the first step
    documents: List[Doc]
    # Indices to be concise
    minibatches: List[List[int]]
    # Candidate Taxonomies (full trajectory)
    clusters: Annotated[List[List[dict]], operator.add]
"""

my_docs = docs[:30]

configurable = {"configurable": {"use_case": "classify the food reviews into categories", "batch_size": 15}}
data = {"documents":my_docs}
batches = get_minibatches(data, configurable)

print(batches)

{'minibatches': [[5, 14, 12, 22, 1, 29, 2, 11, 4, 7, 16, 20, 19, 9, 8], [13, 18, 25, 23, 24, 17, 26, 3, 10, 21, 28, 15, 0, 27, 6]]}


In [18]:

rez = generate_taxonomy({"documents": my_docs, "minibatches": batches["minibatches"], "clusters": []}, configurable)



In [21]:
print(rez.keys())
print("\n".join(str(x) for x in rez["clusters"]))

dict_keys(['clusters'])
[{'id': '1', 'name': 'Pet Food Preferences', 'description': 'Reviews about cat and dog food quality and palatability'}, {'id': '2', 'name': 'Sweet Candy Enjoyment', 'description': 'Positive reviews of sweet candies like taffy and Twizzlers'}, {'id': '3', 'name': 'Condiment Flavor Intensity', 'description': 'Reviews highlighting strong taste experiences with sauces'}, {'id': '4', 'name': 'Product Satisfaction Levels', 'description': 'Reviews expressing general delight or disappointment with food products'}, {'id': '5', 'name': 'Delivery and Freshness', 'description': 'Comments about product delivery and freshness quality'}]


ok !

## Step 4 - Update the taxonomy

In [35]:
taxonomy_update_prompt = hub.pull("wfh/tnt-llm-taxonomy-update")

taxa_update_llm_chain = (
    taxonomy_update_prompt | taxonomy_generation_llm | StrOutputParser()
).with_config(run_name="UpdateTaxonomy")


update_taxonomy_chain = taxa_update_llm_chain | parse_taxa


def update_taxonomy(
    state: TaxonomyGenerationState, config: RunnableConfig
) -> TaxonomyGenerationState:
    which_mb = len(state["clusters"]) % len(state["minibatches"])
    return invoke_taxonomy_chain(
        update_taxonomy_chain, state, config, state["minibatches"][which_mb]
    )



In [44]:
my_docs = docs[:30]

configurable = {"configurable": {"use_case": "classify the food reviews into categories", "batch_size": 15}}
data = {"documents":my_docs}
batches = get_minibatches(data, configurable)

# taxonomy was defined previously
rez = update_taxonomy({"documents": my_docs, "minibatches": batches["minibatches"], "clusters": [taxonomy["clusters"]]}, configurable)

rez

{'clusters': [[{'id': '1',
    'name': 'Food Taste Experience',
    'description': 'Reviews focusing on flavor, enjoyment, and sensory qualities of food products'},
   {'id': '2',
    'name': 'Product Satisfaction Evaluation',
    'description': 'Reviews expressing overall satisfaction, ease of use, and positive product perceptions'},
   {'id': '3',
    'name': 'Product Packaging Critique',
    'description': 'Reviews addressing packaging accuracy, misleading information, and presentation issues'},
   {'id': '4',
    'name': 'Product Quality Assessment',
    'description': 'Reviews evaluating product ingredients, composition, and overall quality standards'},
   {'id': '5',
    'name': 'Specific Food Type Commentary',
    'description': 'Reviews about unique or niche food products like specialty ingredients or rare items'}]]}

In [45]:
rez

{'clusters': [[{'id': '1',
    'name': 'Food Taste Experience',
    'description': 'Reviews focusing on flavor, enjoyment, and sensory qualities of food products'},
   {'id': '2',
    'name': 'Product Satisfaction Evaluation',
    'description': 'Reviews expressing overall satisfaction, ease of use, and positive product perceptions'},
   {'id': '3',
    'name': 'Product Packaging Critique',
    'description': 'Reviews addressing packaging accuracy, misleading information, and presentation issues'},
   {'id': '4',
    'name': 'Product Quality Assessment',
    'description': 'Reviews evaluating product ingredients, composition, and overall quality standards'},
   {'id': '5',
    'name': 'Specific Food Type Commentary',
    'description': 'Reviews about unique or niche food products like specialty ingredients or rare items'}]]}

## Step 5 - Batch process a larger volume

Now we can run the update taxonomy loop



In [48]:
docs = list(create_docs(df.sample(frac=0.02)))
len(docs)

11369

In [49]:
my_docs = docs

configurable = {"configurable": {"use_case": "classify the food reviews into categories", "batch_size": 15}}
data = {"documents":my_docs}
batches = get_minibatches(data, configurable)

len(batches["minibatches"])

758

In [50]:
first_taxonomy = generate_taxonomy({"documents": my_docs, 
                                    "minibatches": batches["minibatches"], "clusters": []}, configurable)



In [75]:
first_taxonomy

{'clusters': [[{'id': '1',
    'name': 'Health-Supportive Beverages',
    'description': 'Teas and drinks with specific wellness benefits like relieving throat issues or managing symptoms.'},
   {'id': '2',
    'name': 'Flavor-Focused Food Items',
    'description': 'Products valued primarily for their taste and enjoyable sensory experience.'},
   {'id': '3',
    'name': 'Convenient Snack Options',
    'description': 'Easy-to-consume food items suitable for quick eating or as part of meals.'},
   {'id': '4',
    'name': 'Product Quality Appreciation',
    'description': 'Reviews emphasizing overall product excellence, nice attributes, or pleasant characteristics.'},
   {'id': '5',
    'name': 'Price-Value Considerations',
    'description': 'Comments highlighting product cost relative to perceived quality or quantity.'}]]}

In [86]:
final_taxonomy = update_taxonomy({"documents": my_docs, 
                                  "minibatches": batches["minibatches"], 
                                  "clusters": [first_taxonomy["clusters"][0]]}, 
                                 configurable)
# Please note the first_taxonomy["clusters"][0]]

In [87]:
final_taxonomy

{'clusters': [[{'id': '1',
    'name': 'Positive Product Endorsement',
    'description': 'Strong favorable reviews expressing enthusiasm and recommendation of the product.'},
   {'id': '2',
    'name': 'Negative Product Criticism',
    'description': 'Reviews expressing disappointment, disgust, or significant dissatisfaction with the product.'},
   {'id': '3',
    'name': 'Neutral Product Assessment',
    'description': 'Lukewarm or indifferent reviews indicating minimal emotional engagement with the product.'},
   {'id': '4',
    'name': 'Taste and Sensory Experience',
    'description': 'Reviews focusing specifically on flavor, texture, and immediate sensory perception.'},
   {'id': '5',
    'name': 'Value and Price Perception',
    'description': 'Comments highlighting economic considerations and perceived cost-quality relationship.'}]]}

## Step 6 - Review mode

Example of taxonomy found 

```
{'clusters': [[{'id': '1',
    'name': 'Flavor and Taste Evaluation',
    'description': 'Reviews focusing on product taste, texture, and sensory experience'},
   {'id': '2',
    'name': 'Dietary and Nutritional Suitability',
    'description': 'Reviews about product compatibility with special diets or health requirements'},
   {'id': '3',
    'name': 'Product Performance and Satisfaction',
    'description': 'Reviews assessing overall product quality, value, and meeting consumer expectations'},
   {'id': '4',
    'name': 'Specialized Food and Beverage',
    'description': 'Reviews of unique or niche food and drink products like gourmet items'}]]}
```


In [96]:
taxonomy_review_prompt = hub.pull("wfh/tnt-llm-taxonomy-review")

taxa_review_llm_chain = (
    taxonomy_review_prompt | taxonomy_generation_llm | StrOutputParser()
).with_config(run_name="ReviewTaxonomy")


review_taxonomy_chain = taxa_review_llm_chain | parse_taxa


def review_taxonomy(
    state: TaxonomyGenerationState, config: RunnableConfig
) -> TaxonomyGenerationState:
    batch_size = config["configurable"].get("batch_size", 200)
    original = state["documents"]
    indices = list(range(len(original)))
    random.shuffle(indices)
    return invoke_taxonomy_chain(
        review_taxonomy_chain, state, config, indices[:batch_size]
    )



In [102]:
doc_subset = my_docs[:15]

configurable = {"configurable": {"use_case": "classify the food reviews into categories", "batch_size": 15}}
data = {"documents":doc_subset}
batches = get_minibatches(data, configurable)

print(batches)

reviewed_taxonomy = review_taxonomy({"documents": doc_subset, "minibatches": batches["minibatches"], "clusters": [final_taxonomy["clusters"][0]]}, 
                                 configurable)

{'minibatches': [[7, 2, 3, 13, 8, 5, 12, 6, 14, 0, 1, 10, 9, 11, 4]]}


In [103]:
print(reviewed_taxonomy)

Let me carefully evaluate the reference cluster table and provide systematic answers:

<rating_score>75</rating_score>

<explanation>
The current reference table has moderate quality for classifying food reviews:
- Strengths:
  * Categories are distinct and non-overlapping
  * No vague "miscellaneous" categories
  * Names are concise and descriptive
- Limitations:
  * Current categories are somewhat generic
  * Lacks specificity for detailed food review analysis
  * Could benefit from more nuanced categorization of review sentiments and characteristics
</explanation>

<suggestions>
1. Add more specific categories related to food reviews:
- Customer Service Experience
- Ingredient Quality Assessment
- Packaging and Presentation Feedback
- Nutritional and Health Considerations
2. Refine existing sentiment categories to capture more granular review insights
3. Ensure descriptions provide clear differentiation between categories
</suggestions>

<updated_table>
<clusters>
  <cluster>
    <i

## Step 7 - Label our data

Now that we have our taxonomy, we can label everything with a LLM


In [104]:
labeling_prompt = hub.pull("wfh/tnt-llm-classify")

labeling_llm = ChatAnthropic(model=MODEL_NAME, max_tokens_to_sample=2000)
labeling_llm_chain = (labeling_prompt | labeling_llm | StrOutputParser()).with_config(
    run_name="ClassifyDocs"
)


def parse_labels(output_text: str) -> Dict:
    """Parse the generated labels from the predictions."""
    category_matches = re.findall(
        r"\s*<category>(.*?)</category>.*",
        output_text,
        re.DOTALL,
    )
    categories = [{"category": category.strip()} for category in category_matches]
    if len(categories) > 1:
        logger.warning(f"Multiple selected categories: {categories}")
    label = categories[0]
    stripped = re.sub(r"^\d+\.\s*", "", label["category"]).strip()
    return {"category": stripped}


labeling_chain = labeling_llm_chain | parse_labels



In [110]:
xml_taxonomy = format_taxonomy(parse_taxa(reviewed_taxonomy)["clusters"])
results = labeling_chain.batch(
    [
        {
            "content": doc["content"],
            "taxonomy": xml_taxonomy,
        }
        for doc in docs[:10]
    ],
    {"max_concurrency": 5},
    return_exceptions=True,
)
# Update the docs to include the categories
updated_docs = [{**doc, **category} for doc, category in zip(docs, results)]

In [113]:
pd.DataFrame(updated_docs).head(50)

Unnamed: 0,id,summary,content,category
0,487213,Great Northern Popocorn,This popcorn is delicious!!! It was very easy ...,Balanced Product Assessment
1,248058,drink it now,This chai is delicious. Everything I wanted an...,Sensory and Flavor Experience
2,306452,Great!,Our Maltese puppy loves this food. We had a ha...,Enthusiastic Product Recommendation
3,105157,Great product.,I couldn't find this anywhere in stores. My d...,Enthusiastic Product Recommendation
4,364953,awful,I bought this product because I thought it cou...,Critical Product Dissatisfaction
5,347646,Buttery oil,The popcorn popped up light but lacked a stron...,Sensory and Flavor Experience
6,551399,Love this YUMMY blue drink,I discovered Liquid Ice in Virginia at a bar. ...,Enthusiastic Product Recommendation
7,285368,Not what it was advertised as,This was advertised as sugar free. It definit...,Critical Product Dissatisfaction
8,542517,Disgusting,This reminds me of dirty socks. Its a pretty ...,Critical Product Dissatisfaction
9,399954,OK,At about two bucks a can for a pack of 12 (one...,Price and Value Evaluation
