# Using an LLM Agent with Web Search and DeepSeek to Classify Sports Teams from a CSV File

This notebook demonstrates how to use a LangChain-based LLM agent with a web search tool to classify sports teams listed in a CSV file. Each row in the CSV contains a team name. The agent will use real-time web search to determine the sport associated with each team and add a corresponding label.

## Step 1: Install Required Packages

We will use the LangChain framework, OpenAI for LLM access, and the Tavily web search integration. Pandas will be used to work with CSV files.

In [24]:
!pip install --quiet langchain_openai pandas tavily-python langchain-community

## Step 2: Set Up API Keys

You will need:
- A DeepSeek API Key
- A Tavily API key (sign up at https://app.tavily.com/)

These are required to access the LLM and the web search service.

In [25]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your DeepSeek API key: ")
os.environ["TAVILY_API_KEY"] = getpass.getpass("Enter your Tavily API key: ")

Enter your DeepSeek API key: ··········
Enter your Tavily API key: ··········


## Step 3: Initialize the Agent with Web Search Tool

We will set up an LLM agent with the Tavily web search tool. The agent will decide when to invoke the search tool to classify each team.

In [26]:
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from langchain_community.tools.tavily_search import TavilySearchResults

# Initialize model and tools
llm = ChatOpenAI(temperature=0,
                 base_url="https://api.deepseek.com",
                 model='deepseek-chat')
search_tool = TavilySearchResults(k=3)
agent = initialize_agent([search_tool], llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True,
                         handle_parsing_errors=True)

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 4: Load the CSV File

We load the CSV file that contains the names of sports teams. Each row has one team name under a column named 'Team'.

In [37]:
import pandas as pd
from IPython.display import display

# Path to your CSV file
data_path = "../../data/sports_teams.csv"
if os.path.exists(data_path):
    df = pd.read_csv(data_path)
    display(df.head())
else:
    try:
        csv_path = "sports_teams.csv"
        df = pd.read_csv(csv_path)
        display(df.head())
    except Exception as e:
        print(f"{e}\nThe CSV file should be in the same directory as the notebook, or you must be running the code locally using git clone.")

Unnamed: 0,Team,Country,City,Stadium Capacity
0,Los Angeles Lakers,USA,Los Angeles,19000
1,Manchester United,UK,Manchester,74000
2,Toronto Maple Leafs,Canada,Toronto,18600
3,Golden State Warriors,USA,San Francisco,19500
4,New England Patriots,USA,Boston,65878


## Step 5: Classify Each Team Using the Agent

The agent will search the web and respond with the sport that each team plays. We apply this to each team in the dataset.

In [38]:
# Define a function that sends a query to an LLM agent to classify the sport associated with a given team name.
# The agent uses web search and reasoning to identify the sport, and returns only the name of the most well-known one.
# If an error occurs during the query process, it returns an error message instead.
def classify_team_with_agent(team_name):
    try:
        query = f"""What sport does the team '{team_name}' play? only return
        the sport name, if there are multiple sports associated with the name,
        only return the most famous one"""
        return agent.invoke(query)["output"]
    except Exception as e:
        return f"error: {str(e)}"

# Apply the classification function to the first 3 rows of the 'Team' column in the DataFrame.
# The result is stored in a new column called 'Sport'. This is useful for testing before applying to the full dataset.
df["Sport"] = df[:3]["Team"].apply(classify_team_with_agent)

df.head()



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The Los Angeles Lakers are a well-known professional sports team, and I need to determine which sport they are most famous for playing. Since they are based in Los Angeles, they are likely part of a major U.S. sports league.

Action: [tavily_search_results_json]  
Action Input: "What sport do the Los Angeles Lakers play?"
[0m
Observation: [tavily_search_results_json] is not a valid tool, try one of [tavily_search_results_json].
Thought:[32;1m[1;3mI need to correct the action format and try again.

Action: tavily_search_results_json  
Action Input: "What sport do the Los Angeles Lakers play?"  
[0m
Observation: [36;1m[1;3m[{'title': 'Los Angeles Lakers News, Videos, Schedule, Roster, Stats', 'url': 'https://sports.yahoo.com/nba/teams/la-lakers/', 'content': '*   [MMA](https://sports.yahoo.com/mma/)\n    *   [WNBA](https://sports.yahoo.com/wnba/)\n    *   [Sportsbook](https://sports.yahoo.com/sportsbook/)\n    * 

Unnamed: 0,Team,Country,City,Stadium Capacity,Sport
0,Los Angeles Lakers,USA,Los Angeles,19000,basketball
1,Manchester United,UK,Manchester,74000,football
2,Toronto Maple Leafs,Canada,Toronto,18600,ice hockey
3,Golden State Warriors,USA,San Francisco,19500,
4,New England Patriots,USA,Boston,65878,


## Step 6: Save the Updated Table

The resulting DataFrame will be saved to a new CSV file with the added sport classification column.

In [39]:
output_path = "classified_teams_with_web_search.csv"
df.to_csv(output_path, index=False)
print(f"Saved results to {output_path}")

Saved results to classified_teams_with_web_search.csv


## Classifying Teams Without an Agent (Direct DeepSeek LLM Call)

In this section, we will replicate the classification task without using an agent. Instead, we will call the DeepSeek model directly for each team using a prompt. This avoids tool usage and allows for simpler, one-shot classification via LLM reasoning only.


In [40]:
# Function that sends a prompt to DeepSeek LLM directly
def classify_team_direct(team_name):
    prompt = f"""
You are a sports expert. Given the team name: "{team_name}", respond with the **sport it is most famously associated with**.
Only output the sport name, and do not include any explanation.
If it's ambiguous, pick the most common interpretation based on global popularity.
"""
    try:
        return llm.predict(prompt).strip()
    except Exception as e:
        return f"error: {str(e)}"

In [41]:
# Apply direct classification to all rows
df["Sport"] = df[:3]["Team"].apply(classify_team_direct)
df[["Team", "Sport"]].head()

Unnamed: 0,Team,Sport
0,Los Angeles Lakers,Basketball
1,Manchester United,football
2,Toronto Maple Leafs,Ice hockey
3,Golden State Warriors,
4,New England Patriots,


In [42]:
# Save the results with the new column
df.to_csv("classified_teams_direct_deepseek.csv", index=False)
print("Results saved to 'classified_teams_direct_deepseek.csv'")


Results saved to 'classified_teams_direct_deepseek.csv'
