Skip to content

Cocoon-Data-Transformation/cocoon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cocoon Logo

License: MIT

😎 Cocoon provides LLM agents to organize raw data in your data warehouse, ready for analysis.

It's currently a lightweight Python Library, that connects to your DB (e.g., Snowflake, DuckDB).

The agent interactively helps with various tasks for data cleaning, data integration, data modeling, and more.

Profile: Semantically understand your data and detect anomalies

Profiling is the first step to understanding the table and identifying any anomalies.

Many small decisions require semantic understanding by LLMs. For example, an age of 100 is acceptable, but -1 is impossible!

Check out more profiles
Dataset Title Profile Link
AQI and Latitude/Longitude of Countries View Profile
2020 Property Sales Data View Profile
AAC Shelter Cat Outcome View Profile
Books View Profile
Cancer View Profile
Divorces 2000-2015 View Profile
German Credit Data View Profile
K-Drama View Profile
Patients View Profile
Used Car Data View Profile
Cite Cocoon Profile
@article{huang2024cocoon,
  title={Cocoon: Semantic Table Profiling Using Large Language Models},
  author={Huang, Zezhou and Wu, Eugene},
  journal={arXiv preprint arXiv:2404.12552},
  year={2024}
}

(Preview) Stage: Automatically suggest cleaning and generate DBT codes

Screenshot where LLMs help you interactively cast columns and fix cases. The output is DBT staging sql/yml.

(Preview) Fuzzy Join/Column Standardization/Entity Matching

Join could be challenging when a standardized join key is missing (e.g., join by non-standardized names).

We help you find the related ones, and explain how they are related.

Cite Cocoon Fuzzy Join
@article{huang2024disambiguate,
  title={Disambiguate Entity Matching through Relation Discovery with Large Language Models},
  author={Huang, Zezhou},
  journal={arXiv preprint arXiv:2403.17344},
  year={2024}
}

Future

We are working on tools to help understand data, break silos and maintain pipelines for the data warehouse.

These will make discovering tables, generating reports, and making predictions incredibly simple.

Email zh2408@columbia.edu to learn more...