# From Wow-Effect to Production: Building Reliable LLM Systems

## Introduction

The most famous applications of LLMs are the ones that I like to call the "wow effect LLMs." There are plenty of viral LinkedIn posts about them, and they all sound like this:

"I built [x] that does [y] in [z] minutes using AI." 

Where: 

- **[x]** is usually something like a web app/platform
- **[y]** is a somewhat impressive main feature of [x]
- **[z]** is usually an integer number between 5 and 10
- **"AI"** is really, most of the time, a LLM wrapper (Cursor, Codex, or similar)

If you notice carefully, the focus of the sentence is not really the quality of the analysis but the amount of time you save. This is to say that, when dealing with a task, people are not excited about the LLM output quality in tackling the problem, but they are thrilled that the LLM is spitting out something quick that might sound like a solution to their problem.

This is why I refer to them as **wow-effect LLMs**. As impressive as they sound and look, these wow-effect LLMs display multiple issues that prevent them from being actually implemented in a production environment. Some of them:

- **The prompt is usually not optimized**: you don't have time to test all the different versions of the prompts, evaluate them, and provide examples in 5-10 minutes.

- **They are not meant to be sustainable**: in that short of time, you can develop a nice-looking plug-and-play wrapper. By default, you are throwing all the costs, latency, maintainability, and privacy considerations out of the window. 

- **They usually lack context**: LLMs are powerful when they are plugged into a big infrastructure, they have decisional power over the tools that they use, and they have contextual data to augment their answers. No chance of implementing that in 10 minutes. 

Now, don't get me wrong: LLMs are designed to be intuitive and easy to use. This means that evolving LLMs from the wow effect to production level is not rocket science. However, it requires a specific methodology that needs to be implemented. 

**The goal of this blog post is to provide this methodology.**

The points we will cover to move from wow-effect LLMs to production-level LLMs are the following:

1. **LLM System Requirements** - When this beast goes into production, we need to know how to maintain it. This is done in stage zero, through adequate system requirements analysis.  

2. **Prompt Engineering** - We are going to optimize the prompt structure and provide some best-practice prompt strategies.

3. **Force structure with schemas and structured output** - We are going to move from free text to structured objects, so the format of your response is fixed and reliable.

4. **Use tools so the LLM does not work in isolation** - We are going to let the model connect to data and call functions. This provides richer answers and reduces hallucinations.

5. **Add guardrails and validation around the model** - Check inputs and outputs, enforce business rules, and define what happens when the model fails or goes out of bounds.

6. **Combine everything into a simple, testable pipeline** - Orchestrate prompts, tools, structured outputs, and guardrails into a single flow that you can log, monitor, and improve over time.

We are going to use a very simple case: **we are going to make an LLM grade data science tests**. This is just a concrete case to avoid a totally abstract and confusing article. The procedure is general enough to be adapted to other LLM applications, typically with very minor adjustments.

Looks like we've got a lot of ground to cover. Let's get started!


## Tough Choices: Cost, Latency, Privacy

Before writing any code, there are a few important questions to ask:

**How complex is your task?**  
Do you really need the latest and most expensive model, or can you use a smaller one or an older family?

**How often do you run this, and at what latency?**  
Is this a web app that must respond on demand, or a batch job that runs once and stores results? Do users expect an immediate answer, or is "we'll email you later" acceptable?

**What is your budget?**  
You should have a rough idea of what is "ok to spend". Is it 1k, 10k, 100k? And compared to that, would it make sense to train and host your own model, or is that clearly overkill?

**What are your privacy constraints?**  
Is it ok to send this data through an external API? Is the LLM seeing sensitive data? Has this been approved by whoever owns legal and compliance?

For simple tasks, where you have a low budget and need low latency, the smaller models (for example the 4.x mini family or 5 nano) are usually your best bet. They are optimized for speed and price, and for many basic use cases like classification, tagging, light transformations, or simple assistants, you will barely notice the quality difference while paying a fraction of the cost.

For more complex tasks, such as complex code generation, long-context analysis, or high-stakes evaluations, it can be worth using a stronger model in the 5.x family, even at a higher per-token cost. In those cases, you are explicitly trading money and latency for better decision quality.

If you are running large offline workloads, for example re-scoring or re-evaluating thousands of items overnight, batch endpoints can significantly reduce costs compared to real-time calls. This often changes which model fits your budget, because you can afford a "bigger" model when latency is not a constraint.

From a privacy standpoint, it is also good practice to only send non-sensitive or "sensitive-cleared" data to your provider, meaning data that has been cleaned to remove anything confidential or personal. If you need even more control, you can consider running local LLMs.


## A LLM Teacher: The Grading Task

For this article, we're building an **automated grading system for Data Science exams**. Students take a test that requires them to analyze actual datasets and answer questions based on their findings. The LLM's job is to grade these submissions by:

1. Understanding what each question asks
2. Accessing the correct answers and grading criteria
3. Verifying student calculations against the actual data
4. Providing detailed feedback on what went wrong

This is a perfect example of why LLMs need tools and context. Without access to the datasets and grading rubrics, the LLM cannot grade accurately. It needs to retrieve the actual data to verify whether a student's answer is correct.

### The Test Structure

Our exam is stored in `test.json` and contains 10 questions across three sections. Students must analyze three different datasets: e-commerce sales, customer demographics, and A/B test results. Let's look at a few example questions:


In [1]:
import json

# Load the test file
with open('data/test.json', 'r') as f:
    test_data = json.load(f)

# Display a few example questions
print("üìù EXAMPLE QUESTIONS FROM THE EXAM\n")
print("="*70)

for section in test_data['sections'][:2]:  # Show first 2 sections
    print(f"\nüîπ Section {section['section']}: {section['title']}")
    print(f"   Dataset: {section['dataset']}")
    print()
    
    # Show first question from each section
    question = section['questions'][0]
    print(f"   Question {question['question_number']} ({question['points']} points):")
    print(f"   {question['question']}")
    print()

print("="*70)


üìù EXAMPLE QUESTIONS FROM THE EXAM


üîπ Section A: E-COMMERCE ANALYSIS
   Dataset: ecommerce_sales.csv

   Question 1 (10 points):
   What is the total revenue generated from the "Electronics" category in Q4 2024? Show your calculation.


üîπ Section B: CUSTOMER SEGMENTATION
   Dataset: customer_data.csv

   Question 5 (10 points):
   How many customers fall into each age group? Young (18-30), Middle (31-50), Senior (51+). Provide exact counts for each segment.



### The LLM's Tools: What It Needs to Access

Here's the critical insight: **the LLM cannot grade these questions from memory alone**. It needs access to:

1. **The Datasets** (`data/datasets/`) - Three CSV files containing the actual data:
   - `ecommerce_sales.csv` - 30 sales transactions with product categories, prices, and statuses
   - `customer_data.csv` - 16 customer records with age, spending, and order history
   - `ab_test_results.csv` - 40 users split between control and treatment groups

2. **Grading Rubric** (`data/class_resources/grading_rubric.json`) - Defines how to grade each question:
   - Point allocation (correct answer, showing work, interpretation)
   - Partial credit criteria
   - Common student errors to check for

3. **Ground Truth Answers** (`data/class_resources/ground_truth_answers.json`) - Contains:
   - Correct answers for each question
   - Step-by-step calculations
   - Detailed methodology explanations

Without these tools, the LLM would just be guessing. It doesn't know what's in `ecommerce_sales.csv` or what the correct revenue for Q4 2024 actually is. **This is where tools and RAG (Retrieval-Augmented Generation) become essential.**

Let's look at what the actual data looks like:


In [None]:
import pandas as pd

# Load and display sample data from each dataset
print("üìä DATASET PREVIEWS\n")
print("="*70)

# E-commerce Sales
print("\nüõí E-commerce Sales (first 5 rows):")
sales_df = pd.read_csv('data/datasets/ecommerce_sales.csv')
print(sales_df.head())
print(f"   Total records: {len(sales_df)}")

# Customer Data
print("\n\nüë• Customer Data (first 5 rows):")
customer_df = pd.read_csv('data/datasets/customer_data.csv')
print(customer_df.head())
print(f"   Total records: {len(customer_df)}")

# A/B Test Results
print("\n\nüß™ A/B Test Results (first 5 rows):")
ab_test_df = pd.read_csv('data/datasets/ab_test_results.csv')
print(ab_test_df.head())
print(f"   Total records: {len(ab_test_df)}")

print("\n" + "="*70)


Now we have our complete setup:

- **10 questions** requiring real data analysis
- **3 datasets** with actual numbers the LLM must verify
- **Grading rubric** defining how to score each answer
- **Ground truth answers** with step-by-step solutions

When a student submits "The total Electronics revenue in Q4 2024 is $6,500", the LLM can't just say "looks good!" or "seems wrong." It must:

1. Access `ecommerce_sales.csv`
2. Filter for Electronics category and Q4 dates
3. Calculate the actual total ($7,398.53)
4. Compare against the student's answer
5. Explain specifically which orders the student missed

This is the difference between a "wow-effect" LLM and a production-ready one. Let's build it step by step.
