😎 Cocoon provides LLM agents to organize raw data in your data warehouse, ready for analysis.
It's currently a lightweight Python Library, that connects to your DB (e.g., Snowflake, DuckDB).
The agent interactively helps with various tasks for data cleaning, data integration, data modeling, and more.
Profiling is the first step to understanding the table and identifying any anomalies.
Many small decisions require semantic understanding by LLMs. For example, an age of 100 is acceptable, but -1 is impossible!
- 👉 Online Service: Drop your CSV, and the profile will be ready in <10 min
- 👉 Python Package: Check out the notebook to interactively profile your table in python
- (Both run the same code; Python package requires LLM API, but is interactive and no size/#col limit)
Check out more profiles
Dataset Title | Profile Link |
---|---|
AQI and Latitude/Longitude of Countries | View Profile |
2020 Property Sales Data | View Profile |
AAC Shelter Cat Outcome | View Profile |
Books | View Profile |
Cancer | View Profile |
Divorces 2000-2015 | View Profile |
German Credit Data | View Profile |
K-Drama | View Profile |
Patients | View Profile |
Used Car Data | View Profile |
Cite Cocoon Profile
@article{huang2024cocoon,
title={Cocoon: Semantic Table Profiling Using Large Language Models},
author={Huang, Zezhou and Wu, Eugene},
journal={arXiv preprint arXiv:2404.12552},
year={2024}
}
- 👉 Python Package: Check out the notebook that cleans tables in Snowflake/DuckDB
- 👉 Check out the 1 min demo
Screenshot where LLMs help you interactively cast columns and fix cases. The output is DBT staging sql/yml.
Join could be challenging when a standardized join key is missing (e.g., join by non-standardized names).
We help you find the related ones, and explain how they are related.
Cite Cocoon Fuzzy Join
@article{huang2024disambiguate,
title={Disambiguate Entity Matching through Relation Discovery with Large Language Models},
author={Huang, Zezhou},
journal={arXiv preprint arXiv:2403.17344},
year={2024}
}
We are working on tools to help understand data, break silos and maintain pipelines for the data warehouse.
These will make discovering tables, generating reports, and making predictions incredibly simple.
Email zh2408@columbia.edu to learn more...