In [6]:
from csvdiffgpt import summarize
from csvdiffgpt import compare
from dotenv import load_dotenv
import os

In [7]:
API_KEY = os.getenv("GEMINI_API_KEY")

In [16]:
result = compare(
    file1="./baseball.csv",
    file2="./baseball2.csv",
    question="What changed between these versions?",
    api_key=API_KEY,
    provider="gemini",
    model='gemini-2.0-flash'
)
print(result)

Okay, let's analyze the differences between the two baseball datasets.

**1. Overview**

Both datasets contain baseball player statistics. The first dataset has 771 rows and 16 columns, while the second has 809 rows and 17 columns. The key difference is the addition of an 'AVG' column in the second dataset and an increase in the number of rows.

**2. Structure Changes**

*   **Added Column:** The second dataset includes a new column named 'AVG' (Batting Average), which is of float64 type.
*   **Row Count:** The second dataset has 38 more rows than the first (809 vs. 771), representing a 4.93% increase.

**3. Content Changes**

*   The values in the common columns ('G', 'Age', 'BB', 'AB', 'H', 'RBI', '3B', '2B', 'CS', 'R', 'SB', 'SO', 'HR', 'PA') have not changed.

**4. Statistical Changes**

*   The mean of column 'G' changed from 66.2 to 66.44.
*   The mean of column 'PA' changed from 208.64 to 209.83.
*   The mean of column 'AB' changed from 185.61 to 186.7.
*   The mean of column 'R

In [9]:
# Using compare() without LLM
comparison_data = compare(
    file1="./baseball.csv",
    file2="./baseball2.csv",
    use_llm=False
)
print(comparison_data)

{'file1': {'path': './baseball.csv', 'row_count': 771, 'column_count': 16, 'metadata': {'file_path': './baseball.csv', 'file_size_mb': 0.04, 'separator': ',', 'total_rows': 771, 'total_columns': 16, 'analyzed_rows': 771, 'analyzed_columns': 16, 'columns': {'Last': {'type': 'object', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 657, 'min_length': 3, 'max_length': 14, 'avg_length': 6.52, 'examples': ['Lansford', 'Lowry', 'Beckwith', 'Mullins', 'Valle']}, 'First': {'type': 'object', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 273, 'min_length': 1, 'max_length': 9, 'avg_length': 4.35, 'examples': ['Keith', 'Jay', 'Rafael', 'Tom', 'Rick']}, 'Age': {'type': 'int64', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 24, 'min': np.int64(20), 'max': np.int64(45), 'mean': 27.98, 'median': 27.0, 'std': 4.37, 'examples': [22, 32, 26, 21, 37]}, 'G': {'type': 'int64', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 160, 'min': np.i

In [17]:
result = summarize(
    "./baseball.csv",
    question="What insights can you provide about this dataset?",
    api_key=API_KEY,
    provider="gemini",
    model='gemini-2.0-flash'
)
print(result)

Okay, I will analyze the provided baseball dataset metadata and provide a summary.

**1. Overview**

This dataset contains baseball player statistics, with 771 rows (representing individual players) and 16 columns (representing various statistics and personal information). The file size is relatively small at 0.04 MB. The data includes batting statistics such as At Bats (AB), Runs (R), Hits (H), Home Runs (HR), and Strikeouts (SO), as well as personal information like name and age.

**2. Key Variables**

*   **Last & First:** Player's last and first names. These are object (string) types with no missing values. The last names have a higher unique count (657) than first names (273), as expected.
*   **Age:** Player's age, ranging from 20 to 45 years old, with an average age of approximately 28.
*   **G:** Games played, ranging from 1 to 163.
*   **PA:** Plate Appearances, ranging from 0 to 742.
*   **AB:** At Bats, ranging from 0 to 687.
*   **R:** Runs scored, ranging from 0 to 130.
* 

In [11]:
metadata = summarize(
    file="./baseball.csv",
    use_llm=False
)
print(metadata)

{'file_path': './baseball.csv', 'file_size_mb': 0.04, 'separator': ',', 'total_rows': 771, 'total_columns': 16, 'analyzed_rows': 771, 'analyzed_columns': 16, 'columns': {'Last': {'type': 'object', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 657, 'min_length': 3, 'max_length': 14, 'avg_length': 6.52, 'examples': ['Griffin', 'Booker', 'McCullers', 'Barfield', 'Reuschel']}, 'First': {'type': 'object', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 273, 'min_length': 1, 'max_length': 9, 'avg_length': 4.35, 'examples': ['Tony', 'Juan', 'Floyd', 'Paul', 'Luis']}, 'Age': {'type': 'int64', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 24, 'min': np.int64(20), 'max': np.int64(45), 'mean': 27.98, 'median': 27.0, 'std': 4.37, 'examples': [26, 33, 31, 31, 39]}, 'G': {'type': 'int64', 'nulls': 0, 'null_percentage': np.float64(0.0), 'unique_count': 160, 'min': np.int64(1), 'max': np.int64(163), 'mean': 66.2, 'median': 56.0, 'std': 52.16, 'exampl