Skip to content

AlexTheAnalyst/SnowflakeCourse

Repository files navigation

PulseFit Sample Data

Realistic sample dataset for a fictional health and fitness app called PulseFit, built for Snowflake ingestion and modeling practice.

Scale

  • 500 users, 12 trainers
  • ~9 months of activity (Jul 2025 – Apr 2026)
  • Realistic signup curves (January bump), churn, upgrades/downgrades
  • Engagement tiers (power/regular/light) drive volume across all fact tables
  • Correlated behavior: goals influence workout types, meal calories, and weight trajectory
  • Intentional messiness: missing country/birth_date, null body_fat_pct and waist_inches, null sleep_hours

Files

File Rows (approx) Format Notes
users.csv 500 CSV dim_user
trainers.csv 12 CSV dim_trainer (includes 1 Inactive)
subscriptions.csv ~450 CSV dim_subscription with upgrade/downgrade/churn paths
goals.csv ~650 CSV user goals with realistic status mix
workouts.csv ~15k CSV fact_workout, trainer_id sometimes null
nutrition_logs.csv ~30k CSV fact_nutrition, meal-level
body_metrics.csv ~4k CSV fact_body_metric, sparse by design
daily_activity.csv ~60k CSV fact_activity, daily cadence
chat_messages.json ~2k NDJSON trainer/user messages, Premium/Elite users only
app_events.json ~40k NDJSON session-grouped with nested metadata

Use cases

  • CSV ingestion into Snowflake (COPY INTO from internal stage)
  • JSON ingestion into VARIANT columns + LATERAL FLATTEN
  • Medallion modeling: raw → staging → facts/dims
  • Cohort retention and churn analysis (signup_date + subscription end_date)
  • Engagement analytics (sessions per user, DAU/WAU/MAU from app_events)
  • Funnel analysis (app_open → workout_logged, app_open → meal_logged)
  • Goal attainment vs. activity volume
  • Trainer performance (message volume, assigned users, retention of coached users)

Suggested Snowflake flow

  1. Upload files to an internal stage (PUT file://... @my_stage).
  2. Create raw tables mirroring CSV schemas.
  3. Load JSON files into VARIANT columns.
  4. Flatten JSON with LATERAL FLATTEN into staging tables.
  5. Build dims (user, trainer, date) and facts (workout, nutrition, activity, event).
  6. Build reporting marts: retention, engagement, coaching impact, revenue.

Known quirks (intentional, for data cleaning practice)

  • ~2% of users have blank country
  • ~1% have blank birth_date
  • ~8% of body_metrics rows have null body_fat_pct
  • ~12% have null waist_inches
  • ~5% of daily_activity rows have null sleep_hours
  • Some emails have embedded numbers or underscores
  • Subscription history includes multi-row users (downgrades/upgrades) requiring window functions to resolve current plan
  • Trainer T3011 is Inactive but may appear on older workouts

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors