The dominant paradigm in AI treats intelligence as abstract computation: take in information, process it through learned functions, produce outputs. Bodies are incidental—useful for gathering data, maybe, but not constitutive of intelligence itself. This post argues that this framing misses something important, and that understanding why could reshape how we build and evaluate AI systems.



## The Sensorimotor Grounding Hypothesis

The core claim of embodied cognition is that meaning is grounded in action. When you understand the word "grasp," you don't just retrieve a dictionary definition—you activate the same sensorimotor circuits involved in actually grasping. Concepts aren't abstract symbols floating in a void; they're patterns of potential interaction with the world.

This isn't mysticism. It's an empirical claim with testable predictions. If meaning is grounded in action, then:

- Processing action words should activate motor cortex (it does)
- Interfering with motor systems should slow language comprehension (it does)
- Abstract concepts should be structured through bodily metaphors (they are: we "grasp" ideas, "weigh" options, "move forward" with plans)

The implications for AI are immediate. If human intelligence is fundamentally structured by embodiment, then systems trained only on text might develop a qualitatively different kind of "understanding"—one that correlates with human concepts but doesn't share their grounding.



## Moravec's Paradox Revisited

In 1988, Hans Moravec observed something puzzling: tasks that seem hard for humans (chess, calculus) are easy for computers, while tasks that seem trivial (walking, recognizing objects, catching a ball) are extraordinarily difficult to automate.

The standard explanation is that evolution had billions of years to optimize sensorimotor control, while abstract reasoning is a recent and still-imperfect add-on. But there's a deeper point: sensorimotor control is computationally harder than it looks because the real world is high-dimensional, noisy, and unforgiving.

Consider walking. You're controlling ~600 muscles through a system with feedback delays of 50-200ms, balancing a top-heavy mass on two small contact points, while dealing with uneven terrain, unexpected obstacles, and perturbations. The state space is enormous. The dynamics are nonlinear. And failure (falling) has immediate, painful consequences.

Chess, by contrast, is turn-based, fully observable, discrete, deterministic, and consequence-free. The only reason it seemed hard to us is that our evolved hardware isn't optimized for tree search. Once we build hardware that is, the problem evaporates.

Moravec's paradox suggests that embodied intelligence isn't just one variety of intelligence—it's the hard case. Abstract reasoning might be a simplified special case that happens to be useful in certain domains.



## Active Inference and Predictive Processing

One of the most compelling frameworks for embodied cognition comes from predictive processing. The idea: brains are fundamentally prediction machines. Perception isn't passive reception of sensory data—it's active inference, where the brain generates predictions and uses sensory input to correct them.

In this view, action and perception are two sides of the same coin:

- **Perception** updates your model of the world to match sensory input
- **Action** changes the world to match your predictions

Both serve the same goal: minimizing prediction error (or "free energy" in the technical literature).

This dissolves the traditional boundary between sensing and acting. An organism that just sits and models the world will accumulate prediction error. To minimize error, you must act—not just to gather information, but to bring the world into alignment with your expectations. "I predict I will be holding a coffee cup" becomes true when you pick one up.

Karl Friston's formalization of this idea has become influential in neuroscience and is starting to influence robotics and AI. The core insight for our purposes: you can't separate the "intelligence" part from the "body" part. The whole system is optimizing prediction, and action is essential to that optimization.



## Brooks and Intelligence Without Representation

Rodney Brooks' subsumption architecture from the 1980s was a direct attack on classical AI. Classical AI builds explicit representations of the world, reasons over them, and then acts. Brooks argued this was backwards: you should build layers of sensorimotor competence that directly couple perception to action, with higher layers modulating lower ones rather than replacing them.

His robots didn't have world models in the traditional sense. They had reactive behaviors that, when combined, produced surprisingly competent navigation and exploration. "The world is its own best model," Brooks argued—why build an expensive internal representation when you can query the real thing for free?

The subsumption architecture fell out of favor as machine learning enabled powerful learned representations. But Brooks' core critique remains relevant:

1. **Representations are expensive** in energy, time, and complexity
2. **The world provides information for free** if you're set up to exploit it
3. **Tight coupling between perception and action** can produce robust behavior without elaborate reasoning

Modern robotics has partially rediscovered these ideas through end-to-end learning, where policies map directly from sensors to actions without explicit intermediate representations. The representation is there, but it's implicit and learned, not hand-designed.



## What Might LLMs Be Missing?

If embodied cognition is right, what does that mean for language models trained only on text?

The strong claim would be: LLMs can never truly understand language because they lack sensorimotor grounding. They manipulate symbols that refer to concepts they've never experienced. This is essentially Searle's Chinese Room argument updated for modern AI.

The weak claim: LLMs develop a different kind of understanding—one that captures statistical regularities in how words are used, which correlates strongly with grounded meaning but isn't identical to it. This might explain certain systematic failures (spatial reasoning, physical intuition) while allowing for genuine competence in many domains.

There's also a deflationary view: maybe grounding doesn't matter as much as philosophers think. If statistical patterns in text capture enough of the structure of grounded concepts, then text-only training might be sufficient for most purposes. The correlation between "how words are used" and "what words mean" is so strong that you can get meaning for free by modeling usage.

I don't think this debate is settled. But noticing it changes how you evaluate AI systems. "Does it produce correct outputs?" is different from "Does it understand in the way we do?" Both questions matter, depending on what you're building.



## The Enactivist Extension

Enactivism pushes embodied cognition further: cognition isn't just grounded in the body—it's constituted by the entire organism-environment system. Intelligence doesn't reside "in the head"; it emerges from the dynamic coupling between brain, body, and world.

This sounds abstract, but it has concrete implications:

- **Offloading**: Experts routinely offload cognitive work to external structures (notes, tools, social systems). The intelligence is in the coupled system, not just the brain.
- **Stigmergy**: Social insects coordinate through environmental modifications (pheromone trails). The "algorithm" is distributed across agents and environment.
- **Affordances**: Perception is already structured for action—we see a chair as "sittable," a handle as "graspable." The world shows up as opportunities for interaction, not raw sensory data.

If enactivism is right, then building embodied AI isn't just about putting a computer in a robot body. It's about designing the entire system—agent, body, and environment—so that intelligence can emerge from their interaction.



## Practical Implications

What does this mean for someone building AI systems?

1. **Don't assume language is enough.** Text captures a lot, but systematic gaps in physical reasoning might require different training data or architectures.

2. **Simulation has limits.** If real-world interaction shapes cognition in deep ways, then sim-to-real gaps might be more fundamental than we assume.

3. **Multimodal training matters.** Grounding language in vision, action, and interaction might not just add capabilities—it might change the nature of what's learned.

4. **Evaluate for the right thing.** If you need grounded understanding, test for it. If you need statistical text competence, test for that. They might come apart.

5. **Bodies aren't optional for some applications.** Household robots, surgical assistants, autonomous vehicles—these require genuine sensorimotor intelligence, not just language competence.

The bigger point: intelligence is not a single thing. There are many kinds, optimized for different niches. Embodied intelligence is one important kind that we're only beginning to understand how to build.





```{=html}
<div style="text-align:center;">
  <img src="image.png" alt="Figure" width="65%"/>
  <p><em>Figure 1. Closed-loop sensorimotor integration vs. open-loop processing</em></p>
</div>
```

