# Talk to Data (CSV Agent)

**Project Goal:**
Allow a user to upload a CSV file, ask natural language questions like "Show me sales trends", and automatically generate graphs based on the data using Gemini 2.5 Flash and LangChain.

**Overview:**
1.  **Upload CSV:** Load your data into the notebook.
2.  **Initialize Agent:** Setup the LangChain Pandas DataFrame Agent with Gemini 2.5 Flash.
3.  **Ask Questions:** Query your data in natural language.
4.  **Visualize:** See the results and graphs automatically.

In [None]:
# Install required libraries
# pandas: for data manipulation
# langchain, langchain-experimental: for the agent framework
# google-generativeai: for interacting with Gemini API
# matplotlib, seaborn: for data visualization
# We handle potential dependency conflicts by cleaning up and upgrading key libraries
!pip uninstall -y langchain-classic
!pip install -q -U pandas google-generativeai matplotlib seaborn tabulate "requests==2.32.4" langchain langchain-community langchain-experimental langchain-core langchain-google-genai

[0m

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from google.colab import userdata
import google.generativeai as genai
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_google_genai import ChatGoogleGenerativeAI

# Set visualization style for better aesthetics
sns.set_theme(style="whitegrid")

In [None]:
# API Key Setup
# Please ensure you have added your 'GOOGLE_API_KEY' in the Colab 'Secrets' (key icon on the left)
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    print("API Key successfully loaded.")
except Exception as e:
    print("Error loading API Key. Please make sure 'GOOGLE_API_KEY' is set in Colab Secrets.")
    # Fallback input if secret isn't found (optional, but good for UX)
    # import getpass
    # os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API Key: ")

API Key successfully loaded.


In [None]:
# CSV Upload
from google.colab import files

print("Please upload your CSV file:")
uploaded = files.upload()

# Get the filename of the uploaded file
filename = next(iter(uploaded))

# Load the CSV into a Pandas DataFrame
try:
    df = pd.read_csv(filename)
    print(f"\nSuccessfully loaded '{filename}'")
    print("Data Preview:")
    display(df.head())
except Exception as e:
    print(f"Error loading CSV: {e}")

Please upload your CSV file:


Saving Mall_Customers.csv to Mall_Customers.csv

Successfully loaded 'Mall_Customers.csv'
Data Preview:


Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [None]:
# Agent Initialization

# Initialize the Gemini model via LangChain
# We set temperature to 0 for more deterministic code generation
llm = ChatGoogleGenerativeAI(
    model="gemini-flash-lite-latest",
    temperature=0,
    max_output_tokens=2048
)

# Create the Pandas DataFrame Agent
# verbose=True helps us see the agent's thought process
# allow_dangerous_code=True is required for the experimental agent to execute Python code
agent = create_pandas_dataframe_agent(
    llm,
    df,
    verbose=True,
    allow_dangerous_code=True
)

print("Agent initialized successfully! You can now ask questions.")

Agent initialized successfully! You can now ask questions.


In [None]:
# Natural Language Query Execution

def ask_agent(query):
    """
    Sends a query to the agent and displays the result.
    Handles plot display if the agent generates a visualization.
    """
    print(f"Processing Query: '{query}'...\n")

    try:
        # Run the agent
        response = agent.run(query)
        print("\nAnswer:")
        print(response)

        # Check if the current figure has content and display it
        # The agent might generate a plot effectively modifying the global plt state
        if plt.get_fignums():
            plt.show()
        else:
            # Sometimes plots might be closed or not shown explicitly if handled inside the tool
            # This ensures we try to show any active plots
            pass

    except Exception as e:
        print(f"An error occurred: {e}")

# interactive loop (optional) or just single queries
query = "Show me a distribution of the numerical columns" # Example default query
# You can change this string to ask different questions!
ask_agent(query)

Processing Query: 'Show me a distribution of the numerical columns'...



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The user wants to see the distribution of the numerical columns in the dataframe `df`. The best way to do this in pandas is to use the `.describe()` method on the dataframe, which provides summary statistics for numerical columns.

First, I should check the columns to confirm which ones are numerical, although based on `df.head()`, 'Age', 'Annual Income (k$)', and 'Spending Score (1-100)' appear to be numerical.

Action: python_repl_ast
Action Input: print(df.describe())[0m[36;1m[1;3m       CustomerID         Age  Annual Income (k$)  Spending Score (1-100)
count  200.000000  200.000000          200.000000              200.000000
mean   100.500000   38.850000           60.560000               50.200000
std     57.879185   13.969007           26.264721               25.823522
min      1.000000   18.000000           15.000000                1.000

In [None]:
for m in genai.list_models():
  if "generateContent" in m.supported_generation_methods:
    print(m.name)

models/gemini-2.5-flash
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-exp-1206
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/gemma-3-1b-it
models/gemma-3-4b-it
models/gemma-3-12b-it
models/gemma-3-27b-it
models/gemma-3n-e4b-it
models/gemma-3n-e2b-it
models/gemini-flash-latest
models/gemini-flash-lite-latest
models/gemini-pro-latest
models/gemini-2.5-flash-lite
models/gemini-2.5-flash-image-preview
models/gemini-2.5-flash-image
models/gemini-2.5-flash-preview-09-2025
models/gemini-2.5-flash-lite-preview-09-2025
models/gemini-3-pro-preview
models/gemini-3-flash-preview
models/gemini-3-pro-image-preview
models/nano-banana-pro-preview
models/gemini-robotics-er-1.5-preview
models/gemini-2.5-computer-use-prev

In [None]:
# Interactive Mode (Optional)
# Run this cell to keep asking questions until you type 'exit'

while True:
    user_input = input("\nAsk a question (or type 'exit' to quit): ")
    if user_input.lower() in ['exit', 'quit']:
        break
    ask_agent(user_input)


Ask a question (or type 'exit' to quit): total sales of males
Processing Query: 'total sales of males'...



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The user is asking for the "total sales of males". Looking at the provided `df.head()`, there is no column explicitly named "sales". The columns available are 'CustomerID', 'Genre', 'Age', 'Annual Income (k$)', and 'Spending Score (1-100)'. I need to determine which column represents "sales". Given the context of customer data, 'Spending Score (1-100)' or 'Annual Income (k$)' might be related to sales/spending. Since "total sales" usually implies a sum of a transaction/spending column, I will assume 'Spending Score (1-100)' is the relevant column for "sales" for now, or perhaps 'Annual Income (k$)' if the context implies income is the measure of sales. Without further clarification, 'Spending Score (1-100)' seems more like a measure of transactional activity (sales/spending). I will calculate the sum of 'Spendi