# Data Analysis Simple Agent
### Overview
This notebook guides you through creating an AI-powered data analysis agent that can interpret and answer questions about a dataset using natural language. It combines language models with data manipulation tools to enable intuitive data exploration.

### Motivation
Data analysis often requires specialized knowledge, limiting access to insights for non-technical users. By creating an AI agent that understands natural language queries, we can democratize data analysis, allowing anyone to extract valuable information from complex datasets without needing to know programming or statistical tools.

### Key Components
- Language Model: Processes natural language queries and generates human-like responses
- Data Manipulation Framework: Handles dataset operations and analysis
Agent Framework: Connects the language model with data manipulation tools
- Synthetic Dataset: Represents real-world data for demonstration purposes

### Method
1. Create a synthetic dataset representing car sales data
2. Construct an agent that combines the language model with data analysis capabilities
3. Implement a query processing function to handle natural language questions
4. Demonstrate the agent's abilities with example queries

### Conclusion
This approach to data analysis offers significant benefits:

- Accessibility for non-technical users
- Flexibility in handling various query types
- Efficient ad-hoc data exploration
By making data insights more accessible, this method has the potential to transform how organizations leverage their data for decision-making across various fields and industries.

##### Import required libraries

In [None]:
!pip -q install langchain huggingface_hub openai  tiktoken python-dotenv
!pip -q install sentence_transformers langchain_community

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m2.2/2.5 MB[0m [31m66.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install -q langchain_groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/131.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m122.9/131.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.1/131.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
import os
from langchain_groq import ChatGroq
from langchain.agents import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from google.colab import userdata

os.environ["GROQ_API_KEY"] = userdata.get("GROQ_API_KEY")
# os.environ["OPEN_API_KEY"] = userdata.get("OPEN_API_KEY")

# set a random seed for reproducibility
np.random.seed(42)

### Generate Sample Data
In this section we create a sample dataset of car sales

In [2]:
# generate sample data
n_rows = 1000

# Generate dates
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_rows)]

# Define data categories
makes = ['Toyota', 'Honda', 'Ford', 'Chevrolet', 'Nissan', 'BMW', 'Mercedes', 'Audi', 'Hyundai', 'Kia']
models = ['Sedan', 'SUV', 'Truck', 'Hatchback', 'Coupe', 'Van']
colors = ['Red', 'Blue', 'Black', 'White', 'Silver', 'Gray', 'Green']

# Create the dataset
data = {
    'Date': dates,
    'Make': np.random.choice(makes, n_rows),
    'Model': np.random.choice(models, n_rows),
    'Color': np.random.choice(colors, n_rows),
    'Year': np.random.randint(2019, 2024, n_rows),
    'Price': np.random.uniform(20000, 80000, n_rows).round(2),
    'Mileage': np.random.uniform(0, 100000, n_rows).round(0),
    'EngineSize': np.random.choice([1.6, 2.0, 2.5, 3.0, 3.5, 4.0], n_rows),
    'FuelEfficiency': np.random.uniform(20, 40, n_rows).round(1),
    'SalesPerson': np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], n_rows)
}

# create dataframe and sort by date
df = pd.DataFrame(data).sort_values('Date')
df.head(5)

Unnamed: 0,Date,Make,Model,Color,Year,Price,Mileage,EngineSize,FuelEfficiency,SalesPerson
0,2024-01-01,Mercedes,Sedan,Green,2023,43712.36,79357.0,1.6,37.2,Charlie
1,2024-01-02,Chevrolet,Hatchback,Red,2020,77173.81,3178.0,3.0,36.9,Charlie
2,2024-01-03,Audi,Truck,White,2020,78023.16,78205.0,3.5,27.2,Alice
3,2024-01-04,Nissan,Hatchback,Black,2022,43567.63,24007.0,3.5,33.7,David
4,2024-01-05,Mercedes,Hatchback,Red,2019,71253.31,14431.0,4.0,26.3,Bob


### Create Data Analysis Agent
Here, we create a Pandas DataFrame agent using LangChain. This agent will be capable of analyzing our dataset and answering questions about it.

In [21]:
# create the pandas DataFrame agent
agent = create_pandas_dataframe_agent(
    ChatGroq(model_name="llama3-8b-8192", temperature=0),
    df,
    verbose=True,
    allow_dangerous_code=True,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,  # FIXED
)
print("Data Analysis Agent is ready. U can ask questions about the data")

Data Analysis Agent is ready. U can ask questions about the data


### Define question-asking function

In [22]:
def ask_agent(question):
    """Function to ask questions to the agent and display the response"""
    response = agent.run(question)
    print(f"Question: {question}")
    print(f"Answer: {response}")
    print("---")

### Example question

In [23]:
ask_agent("What are the column names in the dataset?")
ask_agent("How many Null values in the dataframe?")
ask_agent("Using python, what's the average price of cars sold?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: What are the column names in the dataset?

Thought: I need to find the column names in the dataset.

Action: python_repl_ast
Action Input: df.columns[0m[36;1m[1;3mIndex(['Date', 'Make', 'Model', 'Color', 'Year', 'Price', 'Mileage',
       'EngineSize', 'FuelEfficiency', 'SalesPerson'],
      dtype='object')[0m[32;1m[1;3mThought: The observation shows the column names in the dataset.

Action: python_repl_ast
Action Input: None[0m[36;1m[1;3m[0m[32;1m[1;3mFinal Answer: The column names in the dataset are: 'Date', 'Make', 'Model', 'Color', 'Year', 'Price', 'Mileage', 'EngineSize', 'FuelEfficiency', 'SalesPerson'.[0m

[1m> Finished chain.[0m
Question: What are the column names in the dataset?
Answer: The column names in the dataset are: 'Date', 'Make', 'Model', 'Color', 'Year', 'Price', 'Mileage', 'EngineSize', 'FuelEfficiency', 'SalesPerson'.
---


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;