![Gemini_Generated_Image_c9rji9c9rji9c9rj.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/AFZbm4pQsdBGhOZUnM7XhQ/Gemini-Generated-Image-c9rji9c9rji9c9rj.png)

# **Build a Speed Dating Match Prediction Agent w/ LangGraph**


Estimated time needed: **45** minutes


With Valentine's Day around the corner, love is in the air... and so is data! What makes two people click in just four minutes? Can machine learning predict romantic chemistry? In this lab, we're going to find out.

Speed dating generates some of the most fascinating datasets in social science. Imagine hundreds of people rating each other on attractiveness, intelligence, fun, ambition, and shared interests, all while hoping to find that special someone. Every conversation becomes a data point. Every match (or lack thereof) tells a story. Your mission? Build an AI agent that can predict who will match with who before the date even ends.

But here's where it gets interesting. We're not just building a prediction model. We're building an **intelligent agent** that can think through the entire data science pipeline on its own. Give it one instruction, and watch it autonomously clean messy data, select the most predictive features, and train a sophisticated probability model.

This is what we call an **agentic workflow**. Instead of you manually writing code for each step (clean this, select that, train here), you'll create an AI assistant that *reasons* about what needs to happen next. It's like having a data scientist colleague who can take a complex task and break it down into the right sequence of actions without you having to micromanage every detail.

We'll use **LangGraph** to orchestrate this magic. Think of it as a choreographer for your AI agent, managing the dance between thinking and doing, between reasoning about the problem and taking action to solve it.

Here's what your agent will do:

- Clean speed dating datasets that are messy with weird encodings and missing values
- Figure out which features actually matter for predicting matches (spoiler: it's not always what you think)
- Train an XGBoost model that predicts match probability, not just yes or no outcomes
- Evaluate its own performance and report back like a true data scientist

By the end of this Valentine's Day project, you'll understand how to build AI agents that can handle complex, multistep workflows. You'll see firsthand how the **ReAct pattern** (Reasoning + Acting) transforms simple tools into an intelligent system that makes decisions. And you'll have built something genuinely useful: a match prediction system that could theoretically tell you your chances before you even sit down for that speed date.

Ready to play cupid with code? Let's dive in.

---


# **Table of Contents**

1. [Objectives](#objectives)
2. [Setup](#setup)
   - [Installing Required Libraries](#installing-required-libraries)
3. [Introduction](#introduction)
   - [What Is LangGraph?](#what-is-langgraph)
   - [The ReAct Pattern](#the-react-pattern-reason--act)
   - [Why Use an Agent for ML Workflows?](#why-use-an-agent-for-ml-workflows)
4. [Understanding the Speed Dating Dataset](#understanding-the-speed-dating-dataset)
5. [Building The Agent](#building-the-agent)
   - [Step 1: Define the Tools (The "Actions")](#step-1-define-the-tools-the-actions)
   - [Step 2: Create the Agent State](#step-2-create-the-agent-state)
   - [Step 3: Build the Agent Reasoning Node](#step-3-build-the-agent-reasoning-node)
   - [Step 4: Construct the LangGraph Workflow](#step-4-construct-the-langgraph-workflow)
   - [Step 5: Define the Router Logic](#step-5-define-the-router-logic)
6. [Running the Agent](#running-the-agent)
7. [Understanding the Output](#understanding-the-output)
8. [Exercises](#exercises)
9. [Conclusion](#conclusion)

---


# Objectives


After completing this lab, you will be able to:

- Understand the purpose and structure of a LangGraph workflow for ML pipelines
- Use the ReAct pattern to build autonomous AI agents that reason about data science tasks
- Define custom tools for data cleaning, feature selection, and model training
- Create a state dictionary to manage workflow progress
- Build and execute a LangGraph that orchestrates sequential ML operations
- Evaluate match probability predictions using ROC AUC scores


# Setup


For this lab, we will be using the following libraries:

* [`langgraph`](https://github.com/langchain-ai/langgraph) ‚Äî Core framework for building AI workflows
* [`langchain`](https://python.langchain.com) ‚Äî Base library for managing prompts and agents
* [`langchain-openai`](https://python.langchain.com/docs/integrations/openai) ‚Äî Access to GPT-based models
* [`pandas`](https://pandas.pydata.org/) ‚Äî Data manipulation and analysis
* [`scikit-learn`](https://scikit-learn.org/) ‚Äî Machine learning library for preprocessing and feature selection
* [`xgboost`](https://xgboost.readthedocs.io/) ‚Äî Gradient boosting framework for classification


In [27]:
%%capture
%pip install langchain==1.2.10
%pip install langchain-openai==1.1.9
%pip install langgraph==1.0.8
%pip install openai==2.20.0
%pip install numpy==2.4.2
%pip install pandas==3.0.0
%pip install scikit-learn==1.8.0
%pip install scipy==1.17.0
%pip install xgboost==3.2.0
%pip install tqdm==4.67.3
%pip install joblib==1.5.3
%pip install requests==2.32.5
%pip install PyYAML==6.0.3

Load the required python libraries here:


In [28]:
import pandas as pd
import numpy as np
import io
import os
from typing import Annotated, List, Dict, Any, Union
from typing_extensions import TypedDict
from operator import add

from langchain_openai import ChatOpenAI
from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score, roc_auc_score
import xgboost as xgb

Fetch the pseed dating dataset by running the following code.


In [5]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/sqiJ_CW9x_k2T6C2KgPf6Q/speeddating.csv

'wget' is not recognized as an internal or external command,
operable program or batch file.


# Introduction


![Gemini_Generated_Image_c9rji9c9rji9c9rj.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/AFZbm4pQsdBGhOZUnM7XhQ/Gemini-Generated-Image-c9rji9c9rji9c9rj.png)

Before we dive into code, let's understand the key technologies that make this project powerful: **LangGraph** and the **ReAct agent framework**.

Modern AI applications often involve multiple steps: deciding what to do, selecting the right approach, running computations, and then responding with results. LangGraph helps make that process structured by defining a flow (called a StateGraph) that connects all these steps together in a repeatable and understandable way.

In this project, we'll build a "Match Prediction Scientist" agent that can:
‚Ä¢ Understand complex instructions (like "Clean the data, select features, and train a model")
‚Ä¢ Decide which tool to use at each step
‚Ä¢ Execute data science tasks sequentially
‚Ä¢ Report performance metrics automatically


## What Is LangGraph?


LangGraph is a framework that helps us build multi-step AI workflows, which we call "graphs." Think of it as a flowchart builder specifically designed for AI agents. The framework introduces three key concepts:

1. **State**: The data that flows through your workflow, like a backpack the agent carries from one task to the next. In our case, this includes messages and intermediate results.

2. **Nodes**: Individual tasks the agent performs, such as reasoning about what to do next or executing a tool like data cleaning.

3. **Edges**: Connections that determine the order of tasks, creating the flow of your agent's decision-making process.

It's called a "graph" because in computer science, a graph is a structure of connected nodes, just like a flowchart you might draw on paper. What makes LangGraph special is that it's designed specifically for AI agents, handling the complexity of state management and allowing for sophisticated workflows with conditional logic and loops.


## The ReAct Pattern (Reason + Act)


The **ReAct** pattern is a technique where the AI agent alternates between two modes:

1. **Reason**: The agent thinks about what needs to be done next
2. **Act**: The agent uses a tool to accomplish that task

This cycle repeats until the task is complete. For our match prediction pipeline, the agent will:
- **Reason**: "I need to clean the data first"
- **Act**: Call the `clean_speed_dating_data` tool
- **Reason**: "Now I should select the best features"
- **Act**: Call the `select_top_features` tool
- **Reason**: "Finally, I'll train the model"
- **Act**: Call the `train_probability_model` tool
- **Reason**: "I have all the results, time to report back"


## Why Use an Agent for ML Workflows?


Traditional ML pipelines require you to manually:
- Remember the correct order of operations
- Pass data between functions
- Handle errors at each step
- Track intermediate results

An **agentic workflow** automates this by:
- Letting the AI decide the sequence of operations
- Managing state automatically
- Providing transparency into the reasoning process
- Handling complex multistep tasks with a single instruction

---


# Understanding the Speed Dating Dataset


The speed dating dataset is a public dataset collected by [Ulrik Thyge Pedersen](https://www.kaggle.com/datasets/ulrikthygepedersen/speed-dating/data). It contains information from real speed dating events where participants rated each other on various attributes. Each row represents one person's ratings of another person during a 4-minute date.

**Key Features in the Dataset:**
- **Demographic info**: age, race, field of study
- **Attribute ratings**: attractiveness, sincerity, intelligence, fun, ambition, shared interests (rated 1 to 10)
- **Decision variables**: `decision` (did this person want to see the other again?), `decision_o` (did the partner want to see them again?)
- **Match outcome**: `match` (1 if both said yes, 0 otherwise)

**The Challenge:**
The `decision` and `decision_o` columns are **data leakage**‚Äîif we know both individual decisions, we automatically know if there's a match. For a realistic probability prediction, we need to remove these columns and predict matches based only on ratings and demographics.

Additionally, the dataset has:
- Byte string encodings (e.g., `b'female'` instead of `'female'`)
- Missing values that need imputation
- Categorical variables that need encoding

Our agent will handle all of these automatically!

----


# Building The Agent


## Step 1: Define the Tools (The "Actions")


Tools are the actions our agent can take. Each tool is a Python function decorated with `@tool` that performs a specific task. Let's define three tools for our ML pipeline:


### Tool 1: Clean Speed Dating Data


In [31]:
@tool
def clean_speed_dating_data(file_name: str):
    """Cleans speeddating.csv, handles bytes-strings, and removes leakage columns."""
    df = pd.read_csv(file_name)
    
    # 1. Drop Leakage: 'decision' and 'decision_o' are the individual votes.
    # If we know these, the match is 100% certain. For probability, we drop them.
    df = df.drop(columns=['has_null', 'decision', 'decision_o'], errors='ignore')

    # 2. Clean byte-strings (e.g., b'female' -> female)
    for col in df.select_dtypes(include=['object']).columns:
        df[col] = df[col].astype(str).str.replace("b'", "").str.replace("'", "")
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])

    # 3. Impute missing values
    imputer = SimpleImputer(strategy='median')
    df_clean = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
    
    df_clean.to_csv('cleaned_data.csv', index=False)
    return f"Data cleaned. Rows: {len(df_clean)}. Saved to 'cleaned_data.csv'."

**What This Tool Does:**
- Removes data leakage columns (`decision`, `decision_o`)
- Converts byte strings to regular strings and encodes categorical variables
- Fills missing values with median values
- Saves the cleaned data for the next step


### Tool 2: Select Top Features


In [32]:
@tool
def select_top_features(n_features: int):
    """Uses Recursive Feature Elimination to find the best features for match probability."""
    df = pd.read_csv('cleaned_data.csv')
    X = df.drop(columns=['match'])
    y = df['match']
    
    selector = RFE(RandomForestClassifier(n_estimators=50), n_features_to_select=n_features)
    selector.fit(X, y)
    
    features = X.columns[selector.support_].tolist()
    return {"selected_features": features}

**What This Tool Does:**
- Uses Recursive Feature Elimination (RFE) to identify the most predictive features
- Trains a Random Forest to rank feature importance
- Returns the top N features that best predict matches


### Tool 3: Train Probability Model


In [33]:
@tool
def train_probability_model(features: List[str]):
    """Trains XGBoost and returns the ROC AUC (probability quality score)."""
    df = pd.read_csv('cleaned_data.csv')
    X = df[features]
    y = df['match']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
    
    model = xgb.XGBClassifier(n_estimators=100, eval_metric='logloss')
    model.fit(X_train, y_train)
    
    # Predict Probability
    y_prob = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_prob)
    
    return f"Model trained on {features}. ROC AUC Score: {auc:.4f}. The probability predictions are reliable."

**What This Tool Does:**
- Trains an XGBoost classifier on the selected features
- Predicts match probabilities (not just yes/no classifications)
- Evaluates using ROC AUC, which measures how well the model ranks matches vs. non matches


---


## Step 2: Create the Agent State


The state is the "memory" that flows through the workflow. For our agent, we need to track the conversation messages.


In [34]:
class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add]

**Why Use `Annotated[List[BaseMessage], add]`?**
- The `add` operator tells LangGraph to **append** new messages to the list rather than replacing it
- This creates a full conversation history that the agent can reference
  
---


## Step 3: Build the Agent Reasoning Node


The reasoning node is where the agent thinks about what to do next.


In [35]:
# --- 2. THE GRAPH (The "Reasoning" Loop) ---

class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add]

tools = [clean_speed_dating_data, select_top_features, train_probability_model]
openai_api_key = os.environ.get('OPENAI_API_KEY')
llm = ChatOpenAI(model="gpt-4o-mini", 
                 api_key=openai_api_key,
                 temperature=0).bind_tools(tools)

def agent_reasoning(state: AgentState):
    system_prompt = SystemMessage(content=(
        "You are a Match Prediction Expert. You must use the tools sequentially. "
        "First, clean the data. Then, select the best features. "
        "Finally, train the model and report the probability quality (ROC AUC)."
    ))
    response = llm.invoke([system_prompt] + state["messages"])
    return {"messages": [response]}

**What This Does:**
- Takes the current state (conversation history)
- Adds a system prompt that guides the agent's behavior
- Calls the LLM to decide what to do next
- Returns the LLM's response (which may include tool calls)

---


## Step 4: Construct the LangGraph Workflow


Now we connect everything together into a graph.


In [36]:
# Define Graph Logic
workflow = StateGraph(AgentState)
workflow.add_node("scientist", agent_reasoning)
workflow.add_node("tools", ToolNode(tools))
workflow.set_entry_point("scientist")

# ReAct Loop logic
def router(state):
    if state["messages"][-1].tool_calls:
        return "call_tool"
    return "end"

workflow.add_conditional_edges("scientist", router, {"call_tool": "tools", "end": END})
workflow.add_edge("tools", "scientist")

app = workflow.compile()

**How This Works:**
1. The `scientist` node generates a response
2. The `router` checks if the response contains tool calls
3. If yes ‚Üí go to the `tools` node
4. If no ‚Üí the agent is done, go to `END`
5. After tools execute ‚Üí return to `scientist` to process results

This creates a **loop** where the agent can use multiple tools in sequence!

---


#  Running the Agent


Now let's run our agent with a complex instruction:


In [38]:
query = "Clean 'speeddating.csv', select top 10 features, and predict match probability."

for output in app.stream({"messages": [HumanMessage(content=query)]}, stream_mode="updates"):
    for node_name, state_update in output.items():
        # 1. Capture the Agent's Thought/Reasoning
        if node_name == "scientist":
            message = state_update["messages"][-1]
            
            # If the agent is calling tools
            if message.tool_calls:
                print(f"\nü§î THOUGHT:")
                # Sometimes content is empty when calling tools, so we describe the intent
                print(f"I need to use tools to process the data. Calling: {[t['name'] for t in message.tool_calls]}")
                
                for tool_call in message.tool_calls:
                    print(f"üõ†Ô∏è  ACTION: {tool_call['name']}({tool_call['args']})")
            
            # If it's the final response (no more tool calls)
            else:
                print(f"\n‚úÖ FINAL ANALYSIS:")
                print("-" * 20)
                print(message.content)
                print("-" * 20)

        # 2. Capture the Tool's Output (Observation)
        elif node_name == "tools":
            # The tool node returns a list of ToolMessages
            for tool_msg in state_update["messages"]:
                print(f"\nüëÅÔ∏è  OBSERVATION:")
                # We limit long outputs like CSV summaries for readability
                print(f"{str(tool_msg.content)[:500]}...") 

print("\n" + "="*50)
print("üèÅ Workflow Complete.")

APIConnectionError: Connection error.

---


# Understanding the Output


When you run the agent, you'll see a conversation unfold:

**Expected Output Flow:**

ü§î THOUGHT:
I need to use tools to process the data. Calling: ['clean_speed_dating_data']
üõ†Ô∏è  ACTION: clean_speed_dating_data({'file_name': 'speeddating.csv'})

üëÅÔ∏è  OBSERVATION:
Data cleaned. Rows: 8378. Saved to 'cleaned_data.csv'....

ü§î THOUGHT:
I need to use tools to process the data. Calling: ['select_top_features']
üõ†Ô∏è  ACTION: select_top_features({'n_features': 10})

üëÅÔ∏è  OBSERVATION:
{"selected_features": ["field", "pref_o_attractive", "attractive_o", "funny_o", "shared_interests_o", "attractive_important", "attractive_partner", "interests_correlate", "like", "guess_prob_liked"]}...

ü§î THOUGHT:
I need to use tools to process the data. Calling: ['train_probability_model']
üõ†Ô∏è  ACTION: train_probability_model({'features': ['field', 'pref_o_attractive', 'attractive_o', 'funny_o', 'shared_interests_o', 'attractive_important', 'attractive_partner', 'interests_correlate', 'like', 'guess_prob_liked']})

üëÅÔ∏è  OBSERVATION:
Model trained on ['field', 'pref_o_attractive', 'attractive_o', 'funny_o', 'shared_interests_o', 'attractive_important', 'attractive_partner', 'interests_correlate', 'like', 'guess_prob_liked']. ROC AUC Score: 0.8341. The probability predictions are reliable....

‚úÖ FINAL ANALYSIS:

The data from 'speeddating.csv' has been cleaned, and the top 10 features selected for predicting match probability are:

1. field
2. pref_o_attractive
3. attractive_o
4. funny_o
5. shared_interests_o
6. attractive_important
7. attractive_partner
8. interests_correlate
9. like
10. guess_prob_liked

The model has been trained using these features, and the ROC AUC score is **0.8341**, indicating that the probability predictions are reliable.


üèÅ Workflow Complete.


**What Just Happened?**
1. The agent decided it needed to clean data first
2. It called the cleaning tool and received confirmation
3. It then decided to select features
4. It received the top 10 features
5. It trained a model using those features
6. It received the ROC AUC score
7. Finally, it synthesized all results into a coherent summary

All of this happened **autonomously** from a single instruction!


---


# Exercise 1: Modify the Agent to Use Different Feature Counts


Update the system prompt to allow the agent to experiment with different numbers of features (5, 10, 15) and compare the results.


In [None]:
def experimental_agent_reasoning(state: AgentState):
    system_prompt = SystemMessage(content=(
        "You are a Match Prediction Expert conducting feature selection experiments. "
        "First, clean the data. Then, for each of [5, 10, 15] features: "
        "1. Select that number of features "
        "2. Train a model "
        "3. Record the ROC AUC score "
        "Finally, report which feature count gave the best performance."
    ))
    response = llm.invoke([system_prompt] + state["messages"])
    return {"messages": [response]}

<details>
    <summary>Click here for Solution</summary>

```python
def experimental_agent_reasoning(state: AgentState):
    system_prompt = SystemMessage(content=(
        "You are a Match Prediction Expert conducting feature selection experiments. "
        "First, clean the data. Then, for each of [5, 10, 15] features: "
        "1. Select that number of features "
        "2. Train a model "
        "3. Record the ROC AUC score "
        "Finally, report which feature count gave the best performance."
    ))
    response = llm.invoke([system_prompt] + state["messages"])
    return {"messages": [response]}
```

</details>


---


# Conclusion

Congratulations! You've just built a **Match Prediction Agent** powered by LangGraph!

Here's what you accomplished:

- **Created three specialized tools** for data cleaning, feature selection, and model training
- **Built a ReAct agent** that autonomously decides which tools to use and when
- **Designed a LangGraph workflow** that manages the entire ML pipeline with a single instruction
- **Learned about state management** and how agents maintain context across multiple tool calls
- **Evaluated match predictions** using ROC AUC scores for probability based predictions
- **Handled real world data challenges** like byte strings, missing values, and data leakage

This project demonstrated how agentic workflows can transform complex, multistep data science tasks into simple, natural language instructions. The same pattern can be extended to:
- Hyperparameter tuning experiments
- Multi model comparisons
- Feature engineering pipelines
- Automated reporting and visualization

You now have the foundation to build your own AI agents for any sequential, decision-based workflow!
