### AI-Powered Simple Data Analysis Agent


Trying to build a lightweight **AI-driven analysis assistant** designed to interact with datasets through plain language queries. Instead of writing code or statistical formulas, users can pose questions in natural language and receive direct insights.

#### *Rationale*
Working with data often requires technical skills in programming or analytics. This project demonstrates how an agent can remove that barrier by translating everyday questions into structured operations, making data exploration faster and more approachable.

#### *Design Overview*
- **Demo Dataset**: A simulated car sales dataset is used as the working example.  
- **Natural Language Interface**: A language model interprets user intent.  
- **Analysis Engine**: Data operations are performed through an integrated framework.  
- **Agent Layer**: Coordinates between the model and analysis tools to deliver results.  

#### *Advantages*
- **Lower entry barrier** – accessible to users without technical expertise  
- **On-demand exploration** – quick answers to diverse questions  
- **Adaptability** – applicable to different datasets and domains  

#### *Closing Note*
This agent illustrates how AI can serve as a bridge between raw data and decision-making, providing a more natural and inclusive way to interact with information.


#### Importing Libraries

In [None]:

import os
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain.agents import AgentType
from langchain_openai import ChatOpenAI
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Load environment variables
from dotenv import load_dotenv
import os

# Load environment variables and set OpenAI API key
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

# Set a random seed for reproducibility
np.random.seed(42)

#### Generating Sample Data 

In [None]:
# Generate sample data
n_rows = 1000

# Generate dates
start_date = datetime(2022, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_rows)]

# Define data categories
makes = ['Toyota', 'Honda', 'Ford', 'Chevrolet', 'Nissan', 'BMW', 'Mercedes', 'Audi', 'Hyundai', 'Kia']
models = ['Sedan', 'SUV', 'Truck', 'Hatchback', 'Coupe', 'Van']
colors = ['Red', 'Blue', 'Black', 'White', 'Silver', 'Gray', 'Green']

# Create the dataset
data = {
    'Date': dates,
    'Make': np.random.choice(makes, n_rows),
    'Model': np.random.choice(models, n_rows),
    'Color': np.random.choice(colors, n_rows),
    'Year': np.random.randint(2015, 2023, n_rows),
    'Price': np.random.uniform(20000, 80000, n_rows).round(2),
    'Mileage': np.random.uniform(0, 100000, n_rows).round(0),
    'EngineSize': np.random.choice([1.6, 2.0, 2.5, 3.0, 3.5, 4.0], n_rows),
    'FuelEfficiency': np.random.uniform(20, 40, n_rows).round(1),
    'SalesPerson': np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], n_rows)
}

# Create DataFrame and sort by date
df = pd.DataFrame(data).sort_values('Date')

# Display sample data and statistics
print("\nFirst few rows of the generated data:")
print(df.head())

print("\nDataFrame info:")
df.info()

print("\nSummary statistics:")
print(df.describe())

#### Creating the AI-Powered Simple Data Analysis Agent 

In [None]:
# Create the Pandas DataFrame agent
agent = create_pandas_dataframe_agent(
    ChatOpenAI(model="gpt-4o", temperature=0),
    df,
    verbose=True,
    allow_dangerous_code=True,
    agent_type=AgentType.OPENAI_FUNCTIONS,
)
print("Data Analysis Agent is ready. You can now ask questions about the data.")

#### QA Function

In [None]:
def ask_agent(question):
    """Function to ask questions to the agent and display the response"""
    response = agent.run({
        "input": question,
        "agent_scratchpad": f"Human: {question}\nAI: To answer this question, I need to use Python to analyze the dataframe. I'll use the python_repl_ast tool.\n\nAction: python_repl_ast\nAction Input: ",
    })
    print(f"Question: {question}")
    print(f"Answer: {response}")
    print("---")

#### Testing

In [None]:

# Example questions
ask_agent("What are the column names in this dataset?")
ask_agent("How many rows are in this dataset?")
ask_agent("What is the average price of cars sold?")

## Summary

This prototype shows how a language model can be combined with a dataset interface to answer natural language questions about structured data.  
While effective as a teaching example, the following improvements could make the agent significantly more powerful and closer to production-grade:

- ***Dataset Expansion***  
  Move beyond a toy synthetic dataset (car sales) and connect to **real-world data sources** (CSV files, SQL databases, or APIs). This will make the agent applicable in practical business scenarios.  

- ***Enhanced Query Parsing***  
  Instead of relying only on the language model’s interpretation, integrate a **query translation layer** that converts natural language into **SQL or Pandas expressions** for more accurate and auditable operations.  

- ***Advanced Analysis Tools***  
  Extend capabilities to cover **visualizations (Matplotlib/Plotly)**, **statistical modeling (Scikit-learn, Statsmodels)**, or even **forecasting (Prophet, ARIMA)** to provide richer insights.  

- ***Error Handling & User Guidance***  
  Implement structured strategies for **ambiguous queries**, such as returning clarifying questions or offering suggested interpretations when the model is uncertain.  

- ***Explainability***  
  Add a mechanism to show the **underlying data operations** (e.g., the Pandas or SQL query executed) alongside the answer, improving transparency and trustworthiness.  

- ***Scalability***  
  Replace in-memory processing with **DuckDB, PostgreSQL, or BigQuery** connectors to handle larger datasets efficiently.  

- ***External Knowledge Augmentation***  
  Allow the agent to enrich answers with **external context** (e.g., industry benchmarks, documentation, or real-time APIs), making responses more insightful.  

- ***UI / Accessibility***  
  Wrap the agent in a lightweight interface (e.g., **Gradio, Streamlit, or Jupyter widgets**) so that non-technical users can interact without code.  

---

With these enhancements, the current demo could evolve from a simple proof-of-concept into a **multi-functional data analysis assistant** that is scalable, explainable, and practical for real-world adoption.
