# AI Exploratory Data Analyst 📊

##### 💡 **Research Areas:** Rapid Prototyping, Generative AI, Exploratory Data Analysis, Question-Answering Systems.
<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="300px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

An AI data analyst that performs **Exploratory Data Analysis** on a dataset using question-answering model with LLMs.

The goal of this prototype is to demonstrate the automation of Data Analytics Activity, highlight use-cases for Pandas AI, and experiment with Question-Answer type models.


## Step 1: Importing Dependencies:

- `import os`: Imports the os library, which allows you to interact with the operating system. This is useful for accessing environment variables and file paths.

- `import pandas as pd`: Imports the Pandas library for data manipulation and analysis. We use Pandas to load datasets (CSV files) into DataFrames, which are tabular data structures.

- `from dotenv import load_dotenv`: Imports load_dotenv from the dotenv module. This helps load environment variables from a .env file, ensuring sensitive data like API keys are not hardcoded.

- `from typing import List, Union`: Imports List and Union from typing, which are used for type annotations in Python. They help define the expected types of variables and return values.

- `from pandasai import SmartDataframe, SmartDatalake`: These are classes from the PandasAI library, which allow you to wrap Pandas DataFrames with AI-driven capabilities (such as querying and analyzing the data using LLMs).

- `from pandasai.llm import OpenAI`: Imports the OpenAI API client from PandasAI, which is used to interact with OpenAI models for natural language queries.

- `from pydantic import BaseModel`: Imports BaseModel from Pydantic, which is a data validation library. BaseModel is used to define data structures with built-in validation, ensuring that data conforms to the expected format.

- `from groq import Groq`: Imports the Groq client, which facilitates interactions with the Groq AI model. It will be used to handle querying and generating responses from the AI model.

- `import instructor`: Imports the instructor module, which is used for managing the interface between Groq and the AI model. It helps facilitate communication and query handling.


## Step 2: Installing Requirements

- `requirements_installed = False`: Initializes a boolean flag requirements_installed to False, indicating whether the required dependencies are installed.

- `max_retries = 3`: Sets the maximum number of retry attempts for installing the dependencies.

- `retries = 0`: Initializes the retries counter to 0.

- `def install_requirements():`: Defines the install_requirements function, which checks and installs the required libraries using pip.

- `if requirements_installed::`: Checks if the requirements have already been installed. If yes, it prints a message and exits the function.

- `install_status = os.system("pip install -r requirements.txt")`: Runs the system command to install the dependencies from requirements.txt. The return status indicates whether the installation was successful.

- `if install_status == 0::`: If the installation succeeds (install_status == 0), it sets requirements_installed = True and prints a success message.

- `else::`: If the installation fails, it retries up to max_retries. If retries exceed the limit, the program exits with an error.

- `install_requirements()`: Calls the install_requirements() function to start the installation process.


In [45]:
import os

requirements_installed = False
max_retries = 3
retries = 0


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return
install_requirements()

## Step 3. Setting Up Environment Variables

- `def setup_env()`: This defines the `setup_env()` function, which is responsible for loading environment variables and checking whether the necessary ones are set. This function ensures that all required variables are available before running the application.

- `def check_env(env_var)`: This defines a helper function called check_env to verify if a given environment variable is set. It retrieves the variable's value using os.getenv(env_var).

- `value` = os.getenv(env_var): This fetches the value of the specified environment variable using os.getenv. If the variable is not set, it will return `None`.

- `if value is None`: If the value returned is None (which means the environment variable is not set), the function prints a message and terminates the program with an error. This ensures the application doesn't run without the necessary environment variables.

- `load_dotenv`: This line loads environment variables from a .env file using the dotenv library. The override=True flag ensures that any existing environment variables are overwritten by the values in the .env file, making sure the latest settings are used.

- `variables_to_check` = **["GROQ_API_KEY", "OPENAI_API_KEY", "PANDASAI_API_KEY"]**: Here, a list of required environment variables is defined. These variables are necessary for the proper functioning of the application and are checked later in the process.

- `for var in variables_to_check`: This loop iterates through each variable in the variables_to_check list and calls the check_env function to ensure each environment variable is set before proceeding with the application.

- `setup_env()`: This line calls the setup_env() function to start the process of checking and loading the required environment variables. It ensures the application has the correct configuration before it starts.


In [48]:
from dotenv import load_dotenv
import os


def setup_env():
    """Sets up the environment variables"""

    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")

    load_dotenv(override=True)

    variables_to_check = ["GROQ_API_KEY", "OPENAI_API_KEY", "PANDASAI_API_KEY"]

    for var in variables_to_check:
        check_env(var)
        
setup_env()

## Step 4: Groq Client Setup

### Functions

#### 1. get_groq_client()

- `def get_groq_client():` This function is responsible for initializing the Groq client using the API key provided in the environment variables.

- `groq = Groq(api_key=os.getenv("GROQ_API_KEY")):` Creates an instance of the `Groq` class, initializing it with an API key retrieved from the environment variable `GROQ_API_KEY`. If this environment variable is not set, it could lead to errors, so it is important to load and set the key correctly in your environment.

- `client = instructor.from_groq(groq, mode=instructor.Mode.JSON):` This line uses the `instructor` module to create a client from the Groq instance. It specifies that the mode of interaction should be `JSON`, meaning that the responses and requests will be in JSON format.

- `return client:` The function returns the created client, which will be used to make queries to the Groq API.

#### 2. llm()

##### Function Definition:

- `prompt: str:` The user’s input that you want the LLM (large language model) to respond to. It's expected to be a string.

- `response_model: BaseModel:` Specifies the expected response model for the data returned by the AI. This model must be a subclass of `BaseModel` (likely a Pydantic model).

- `system="You are a helpful AI assistant..."`: A default system message that provides context to the AI on how to behave while interacting with the user. The AI is instructed to be a helpful assistant.

- `model=DEFAULT_MODEL:` This defines the model that will be used for generating responses. The default model (`DEFAULT_MODEL`) is likely a constant that you should define elsewhere in your code (it could be "llama-3.3-70b-versatile" or something else).

- `-> Union[BaseModel, LLMErrorResponse]:` This is the return type hint indicating that the function will return either an instance of `BaseModel` (the normal response) or an `LLMErrorResponse` (if there is an error).

##### Inside the Function:

- `client = get_groq_client():` Calls the previously defined `get_groq_client()` function to obtain the client that interacts with the Groq AI.

- `messages = [{"role": "system", "content": system}, {"role": "user", "content": prompt}]:` This constructs the messages to send to the Groq model. It includes:

  - A system message that provides context about the assistant's behavior.
  - A user message, which is the actual query or prompt sent by the user.

- `response = client.chat.completions.create(...):` This sends the constructed messages to the Groq model and requests a response (completion). The model and `response_model` are used to specify which AI model to use and how to format the response.

- `return response:` The function returns the AI's response (it could be structured according to the `response_model`).

##### Error Handling:

- `except Exception as e:` If an exception is raised during the execution (e.g., network error, invalid API key), it will be caught here.

- `traceback.print_exc():` This prints the traceback of the error, providing detailed information about where the exception occurred.

- `return LLMErrorResponse(error=str(e)):` If an error occurs, it returns an instance of `LLMErrorResponse`, passing the error message as a string. This helps provide structured error handling.

In [50]:
import instructor
from groq import Groq
import traceback
from pydantic import BaseModel
from typing import Union


def get_groq_client():
    """Returns an instance of the Groq class"""
    groq = Groq(api_key=os.getenv("GROQ_API_KEY"))
    client = instructor.from_groq(groq, mode=instructor.Mode.JSON)
    return client


def llm(
    prompt: str,
    response_model: BaseModel,
    system="You are a helpful AI assistant. The user will talk to you and its your job to provide detailed and clear responses.",
    model=DEFAULT_MODEL,
) -> Union[BaseModel, LLMErrorResponse]:
    """Calls LLM API with the given prompt. Defaults to llama-3.3-70b-versatile""",
    try:
        client = get_groq_client()

        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]

        response = client.chat.completions.create(
            messages=messages, model=model, response_model=response_model
        )
        return response
    except Exception as e:
        traceback.print_exc()
        return LLMErrorResponse(error=str(e))
    
setup_env()

## Step 5: Dataset Querying and Analysis System

#### 1. Imports and Libraries :

- `pandas`: Used for data manipulation, particularly to read and work with CSV files.

- `os`: Allows interaction with the operating system (file and directory operations, environment variables).

- `typing.List`: Used for type hinting, specifically to indicate that a parameter is a list of items (in this case, strings).

- `pandasai`: A library providing AI-enhanced functionalities for dataframes (SmartDataframe) and datalakes (SmartDatalake), which allows interacting with data using language models.

- `OpenAI`: Used to interact with OpenAI's language models to analyze and process queries.

---

#### 2. FormattedLLMResponse Class :

- A class that defines the structure of the response expected from the language model. It extends `BaseModel` from pydantic, which ensures that the response will be formatted as a response string.

---

#### 3. Dataset Class :

- Represents a dataset made up of multiple tables (CSV files) located in a specified folder.

- It requires a database name and a list of tables (CSV files).

- Upon initialization, it checks whether the database and the specified tables exist. If any issues are encountered (e.g., missing files), appropriate exceptions are raised.

- Each table is loaded as a pandas DataFrame, and a `SmartDataframe` (an enhanced version of DataFrame) is created for querying the data using AI models.

- A `SmartDatalake` is created, which acts as a container for all the `SmartDataframe` objects, enabling cross-table queries using AI.

---

#### 4. Methods in Dataset :

- `get_table`: This method returns the `SmartDataframe` object for a specified table. If the table does not exist, it raises an exception.

- `get_datalake`: Returns the `SmartDatalake` instance, which contains all the `SmartDataframes` for the dataset.

- `ask`: This method takes a query (string) and a desired response format (e.g., markdown). It first queries the `SmartDatalake` with the provided query. The results are formatted, and the response is passed to the language model for interpretation. It generates a human-readable summary, insights, and recommendations.

---

#### 5. Subclasses of Dataset :

- `BooksDataset`: A subclass that initializes the `Dataset` class with the "books" database and three tables: "books", "ratings", and "users".

- `StudentPerformanceDataset`: Another subclass for handling student performance data, initialized with the "student_performance" database and the "student_performance" table.

---

#### 6. Functions for Querying :

- `ask_query`: A function that asks a query to a specified dataset. It processes the query using the dataset's `ask` method and formats the result for display.

- `ask_query_on_books`: Specifically handles queries for the `BooksDataset` by calling `ask_query` with the "books" dataset.

- `ask_query_on_student_performance`: Specifically handles queries for the `StudentPerformanceDataset` by calling `ask_query` with the "student_performance" dataset.

---

#### 7. Business Report Generation :

- `BusinessReport` Class: This class is used to define the structure of a business report response, which will be generated from a dataset analysis.

- `DataAnalysisAgent`: This agent is responsible for analyzing a dataset by asking predefined analysis questions (e.g., trends, patterns, recommendations). It then compiles the answers and creates a detailed business report using the language model.

---

#### 8. Data Analysis Functions :

- `data_analysis`: This function uses the `DataAnalysisAgent` to perform an analysis on a given dataset and generates a business report from the analysis results.

- `data_analysis_on_books`: A function that runs the analysis on the `BooksDataset` and generates a report.

- `data_analysis_on_student_performance`: A function that runs the analysis on the `StudentPerformanceDataset` and generates a report.

---

### Key Concepts :

- **SmartDataframe** and **SmartDatalake**: These are enhanced pandas objects that enable querying data using AI models (like OpenAI's GPT models).

- **Query Handling**: Queries are sent to the dataset via the `ask` method, which returns a formatted response with insights, recommendations, or answers based on the data.

- **Data Analysis and Business Reporting**: The `DataAnalysisAgent` class performs data analysis by asking a set of predefined questions. It then uses the language model to generate a professional business report summarizing the findings.

---

### Flow :

- **Initialization**: The dataset is loaded from CSV files into memory. The `SmartDataframe` objects are created, and a `SmartDatalake` is set up for AI-based querying.

- **Querying**: Users can ask specific queries (e.g., "What are the top 10 books with the highest average rating?"), which are answered using the dataset and AI.

- **Data Analysis**: The `DataAnalysisAgent` performs deeper analysis by asking predefined questions about the dataset and generating a business report using the AI model.


In [52]:
import pandas as pd
import os
from typing import List
from pandasai import SmartDataframe, SmartDatalake
from pandasai.llm import OpenAI


class FormattedLLMResponse(BaseModel):
    response: str


class Dataset:
    """Dataset class which contains tables corresponding to CSV files in the base folder"""

    def __init__(self, database: str, tables: List[str]):
        llm = OpenAI(api_token=os.getenv("OPENAI_API_KEY"))
        if len(tables) == 0:
            raise ValueError("Tables list cannot be empty.")

        if not os.path.isdir(f"data/{database}"):
            raise FileNotFoundError(f"Directory {database} not found.")

        self.base_path = os.path.join(os.getcwd() + "/data", database)
        self.tables = {}
        self.dataframes = {}

        for table in tables:
            table_file_path = os.path.join(self.base_path, f"{table}.csv")
            if not os.path.isfile(table_file_path):
                print(f"Table file {table_file_path} not found. Skipping. ")
                continue
            df = pd.read_csv(table_file_path)
            self.tables[table] = SmartDataframe(df)
            self.dataframes[table] = pd.read_csv(table_file_path)

        print(f"Loaded {len(self.tables)} tables from {database} dataset.")

        self.datalake = SmartDatalake(
            [self.tables[key] for key in self.tables.keys()], config={"llm": llm}
        )

    def get_table(self, table_name: str) -> SmartDataframe:
        table = self.tables.get(table_name)
        if table is None:
            raise ValueError(f"Table {table_name} not found.")
        return table

    def get_datalake(self) -> SmartDatalake:
        return self.datalake

    def ask(self, query: str, format="markdown") -> str:
        print("Asking...")
        datalake = self.get_datalake()
        df = datalake.chat(query)
        print("Asked.")
        df_string = ""
        if type(df) == str:
            df_string = df
        elif type(df) == pd.DataFrame:
            df_top = df.head(10)
            df_string = str(df_top)
        prompt = f"""
            Query: {query}
            Database Result: {df_string}
            Instructions:
            - Format the data and provide a human readable summary.
            - Provide insights and recommendations.
            - Don't hallucinate or make up data.
            - Provide clear, detailed and concise responses.
            f"The 'response' field should be of format = '{format}'" if format == "markdown" else ""
        """
        llm_response = llm(prompt, response_model=FormattedLLMResponse)
        return llm_response.response


class BooksDataset(Dataset):
    def __init__(self):
        super().__init__("books", ["books", "ratings", "users"])


class StudentPeformanceDataset(Dataset):
    def __init__(self):
        super().__init__("student_performance", ["student_performance"])

## Step 6: Render Markdown Output

#### Function Overview:

The `render_output` function is typically used to convert raw Markdown text into a formatted view within a Jupyter notebook, making the output human-readable and visually structured (such as headers, lists, code formatting, etc.).

---

### Explanation:

1. `from IPython.display import Markdown`:  
This imports the Markdown class from the IPython.display module, which allows you to render Markdown-formatted text in IPython environments like Jupyter notebooks.  

2. `Function Definition` (render_output):  

    - Input: The function takes a single argument, markdown, which is expected to be a string in Markdown format.  

    - Output: The function returns the Markdown object initialized with the input string. This object is then rendered by IPython's display mechanism when executed in a Jupyter notebook.  

3. `return Markdown` (markdown):  
This line creates a Markdown object with the provided markdown text and returns it so it can be displayed in a notebook.


In [53]:
from IPython.display import Markdown


def render_output(markdown: str) -> None:
    """Renders the generated output file as markdown."""
    return Markdown(markdown)

## Step 7: Querying Dataset and Displaying the Result

### Purpose:

The `ask_query` function is useful for querying a dataset interactively in a Jupyter notebook. It ensures a clean and readable output by clearing previous outputs and formatting the query and its result properly.

---

### Explanation:

1. `from IPython.display import clear_output`:  

This imports the `clear_output` function from the IPython.display module, which allows you to clear the output area of a Jupyter notebook. It's often used to refresh the display before showing new content.  

2. `Function Definition` (ask_query):  

- Input:  
    - `query`: A string representing the query you want to ask the dataset.  
    - `dataset`: An instance of the `Dataset` class (or its subclass). This dataset contains the data that will be queried.  

- Output:  
    - The function doesn't return anything directly but uses the `render_output` function to display the formatted result in the notebook.  

3. `response = dataset.ask(query)`:  

This calls the ask method of the provided dataset and passes the query. The result of the query is stored in response.  

4. `clear_output()`:  

This clears the notebook's output area to ensure that previous outputs do not clutter the screen before displaying the new response.  

5. `response = f"### Query: {query}\n{response}"`:  

This formats the response by adding a heading (`### Query: {query}`) and the actual response from the query. The query is displayed above the result for clarity.  

6. `return render_output(response)`:  

The formatted response is passed to the `render_output` function, which is responsible for rendering the Markdown-formatted text for display in the notebook.

In [54]:
from IPython.display import clear_output


def ask_query(query: str, dataset: Dataset) -> None:
    """Asks the given query to the dataset and returns the response."""
    response = dataset.ask(query)
    clear_output()
    response = f"### Query: {query}\n{response}"
    return render_output(response)

## Step 8: Query Execution on Books and Student Performance

`ask_query_on_books` & `ask_query_on_student_performance`:

- Specialized functions that initialize the respective dataset objects and query the data using the ask_query method.

In [55]:
def ask_query_on_books(query: str) -> None:
    """Main function to run the script"""
    books_dataset = BooksDataset()
    return ask_query(query, books_dataset)


def ask_query_on_student_performance(query: str) -> None:
    """Main function to run the script"""
    student_performance_dataset = StudentPeformanceDataset()
    return ask_query(query, student_performance_dataset)

## Step 9: Running Queries
- Sample Queries:

    - Queries for top 10 books with highest average ratings and top 10 students with highest average grades.

In [None]:
query = "What are the top 10 books with the highest average rating?"

ask_query_on_books(query)

query = "What are the top 10 students with the highest average grades?"

ask_query_on_student_performance(query)

## Step 10: DataAnalysisAgent Class

- Handles data analysis by asking a series of predefined questions and aggregating the answers into a business report.
- `analyse` method: Asks predefined questions, compiles answers, and generates a report using LLM.

- `get_analysis` method: Returns the analysis, either generated or cached.

In [58]:
from pydantic import BaseModel


class BusinessReport(BaseModel):
    markdown: str


class DataAnalysisAgent:
    def __init__(self, dataset: Dataset):
        self.dataset = dataset
        self.analysis = None

    def analyse(self) -> str:
        # instruction = "Respond in a human readable format. Don't provide graphs or execute code."
        analysis_questions = [
            "What are the key insights from the dataset?",
            "What are the key trends in the dataset?",
            "What are the key patterns in the dataset?",
            "What are the key anomalies in the dataset?",
            "What are the key recommendations based on the dataset?",
        ]
        self.analysis = ""
        for question in analysis_questions:
            try:
                answer = self.dataset.ask(question)
                self.analysis += f"\n### 👉🏽 {question}\n💡 {answer}"
            except Exception as e:
                print(f"Failed to ask question: {question}")
                # print(e)
                continue
        print("Preliminary analysis completed. Generating business report...")
        prompt = f"""
            Understand the provided analysis for the dataset and provide a neatly formatted report.
            This report should be professional, detailed and ideal for business stakeholders.
            The report should include all facts in the preliminary analysis and provide additional insights.
            The 'response' field should be of format = 'markdown'

            Preliminary Analysis: {self.analysis}
        """
        llm_response = llm(prompt=prompt, response_model=BusinessReport)
        if not llm_response:
            print("Failed to get response from LLM.")
            return self.analysis
        self.analysis = llm_response.markdown
        print("Business report generated.")
        return self.analysis

    def get_analysis(self) -> str:
        if self.analysis is None:
            return self.analyse()
        return self.analysis

## Step 11: Data Analysis Execution
- `data_analysis_on_books` & `data_analysis_on_student_performance`:

    - Calls the DataAnalysisAgent on the respective datasets (Books and Student Performance) and renders the analysis.

In [59]:
from IPython.display import clear_output, Markdown


def data_analysis(dataset: Dataset) -> Markdown:
    agent = DataAnalysisAgent(dataset)
    analysis = agent.analyse()
    clear_output()
    return render_output(analysis)


def data_analysis_on_books() -> Markdown:
    books_dataset = BooksDataset()
    return data_analysis(books_dataset)


def data_analysis_on_student_performance() -> Markdown:
    student_performance_dataset = StudentPeformanceDataset()
    return data_analysis(student_performance_dataset)

## Step 12: Triggering the Data Analysis:

- Calls data analysis on both the student performance and books datasets.

In [None]:
data_analysis_on_student_performance()
data_analysis_on_books()

---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>
