# Data Analysis Simple Agent 
## Overview
In this noteboook i will create a AI-powered data analysis agent that can interpret and answer questions about a dataset using natural language. It combines language models with data manipulation tools to enable intuitive data exploration.

## Key Components
1. Language Model
2. Data Manipulation Framework
3. Agent Framework
4. Synthetic Dataset


## Method
1. Create a synthetic dataset representing car sales data
2. Construct an agent that combines the language model with data analysis capabilities.
3. Implement a query processing function to handle natural language questions.
4. Demonstrate the agent's abilities with example queries.

import libraries and set env variables

In [43]:
from gettext import npgettext
import os
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain.agents import AgentType
from langchain_google_genai import ChatGoogleGenerativeAI
import pandas as pd
import numpy as np
from datetime import datetime, timedelta


from dotenv import load_dotenv
load_dotenv()

np.random.seed(42)

In [44]:
# Generate sample data
n_rows = 1000

# Generate dates
start_date = datetime(2022, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_rows)]

# Define data categories
makes = ['Toyota', 'Honda', 'Ford', 'Chevrolet', 'Nissan', 'BMW', 'Mercedes', 'Audi', 'Hyundai', 'Kia']
models = ['Sedan', 'SUV', 'Truck', 'Hatchback', 'Coupe', 'Van']
colors = ['Red', 'Blue', 'Black', 'White', 'Silver', 'Gray', 'Green']

# Create the dataset
data = {
    'Date': dates,
    'Make': np.random.choice(makes, n_rows),
    'Model': np.random.choice(models, n_rows),
    'Color': np.random.choice(colors, n_rows),
    'Year': np.random.randint(2015, 2023, n_rows),
    'Price': np.random.uniform(20000, 80000, n_rows).round(2),
    'Mileage': np.random.uniform(0, 100000, n_rows).round(0),
    'EngineSize': np.random.choice([1.6, 2.0, 2.5, 3.0, 3.5, 4.0], n_rows),
    'FuelEfficiency': np.random.uniform(20, 40, n_rows).round(1),
    'SalesPerson': np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], n_rows)
}

# Create DataFrame and sort by date
df = pd.DataFrame(data).sort_values('Date')

# Display sample data and statistics
print("\nFirst few rows of the generated data:")
print(df.head())

print("\nDataFrame info:")
df.info()

print("\nSummary statistics:")
print(df.describe())


First few rows of the generated data:
        Date       Make      Model  Color  Year     Price  Mileage  \
0 2022-01-01   Mercedes      Sedan  Green  2022  57952.65   5522.0   
1 2022-01-02  Chevrolet  Hatchback    Red  2021  58668.22  94238.0   
2 2022-01-03       Audi      Truck  White  2019  69187.87   7482.0   
3 2022-01-04     Nissan  Hatchback  Black  2016  40004.44  43846.0   
4 2022-01-05   Mercedes  Hatchback    Red  2016  63983.07  52988.0   

   EngineSize  FuelEfficiency SalesPerson  
0         2.0            24.7       Alice  
1         1.6            26.2         Bob  
2         2.0            28.0       David  
3         3.5            24.8       David  
4         2.5            24.1       Alice  

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Date            1000 non-null   datetime64[n

In [48]:
df.head()

Unnamed: 0,Date,Make,Model,Color,Year,Price,Mileage,EngineSize,FuelEfficiency,SalesPerson
0,2022-01-01,Mercedes,Sedan,Green,2022,57952.65,5522.0,2.0,24.7,Alice
1,2022-01-02,Chevrolet,Hatchback,Red,2021,58668.22,94238.0,1.6,26.2,Bob
2,2022-01-03,Audi,Truck,White,2019,69187.87,7482.0,2.0,28.0,David
3,2022-01-04,Nissan,Hatchback,Black,2016,40004.44,43846.0,3.5,24.8,David
4,2022-01-05,Mercedes,Hatchback,Red,2016,63983.07,52988.0,2.5,24.1,Alice


In [45]:
load_dotenv()
os.environ["GOOGLE_API_KEY"] = os.getenv('GOOGLE_API_KEY')
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash", # Changed model here!
    temperature=0,
    google_api_key=os.getenv('GOOGLE_API_KEY')
)

In [46]:
agent_executor = create_pandas_dataframe_agent(
    llm,
    df,
    verbose=True,
    allow_dangerous_code=True,
    return_intermediate_steps=True
)

In [49]:
# Ask function
def ask_agent(question):
    response = agent_executor.invoke(question)
    print(f"Question: {question}")
    print(f"Answer: {response}")
    print("---")

# Questions
ask_agent("What are the column names in this dataset?")
ask_agent("How many rows are in this dataset?")
ask_agent("What is the average price of cars sold?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to access the column names of the pandas DataFrame `df`.  I can do this using the `columns` attribute.
Action: python_repl_ast
Action Input: `print(df.columns)`[0m[36;1m[1;3mIndex(['Date', 'Make', 'Model', 'Color', 'Year', 'Price', 'Mileage',
       'EngineSize', 'FuelEfficiency', 'SalesPerson'],
      dtype='object')
[0m[32;1m[1;3mI now know the final answer.
Final Answer: ['Date', 'Make', 'Model', 'Color', 'Year', 'Price', 'Mileage', 'EngineSize', 'FuelEfficiency', 'SalesPerson']
[0m

[1m> Finished chain.[0m
Question: What are the column names in this dataset?
Answer: {'input': 'What are the column names in this dataset?', 'output': "['Date', 'Make', 'Model', 'Color', 'Year', 'Price', 'Mileage', 'EngineSize', 'FuelEfficiency', 'SalesPerson']", 'intermediate_steps': [(AgentAction(tool='python_repl_ast', tool_input='`print(df.columns)`', log='Thought: I need to access the column names of the pandas Da