## 🌲 Full HexaDruid Feature Showcase

 We’ll walk through every top-level API with tips, parameter explanations, and best practices:

  1. **Schema inference & DRTree** (`schemaVisor` / `infer_schema`)
 2. **Skew column discovery** (`detect_skew`)
 3. **Key detection** (`detect_keys`)
 4. **Parameter recommendations** (`AutoParameterAdvisor`)
 5. **Fast heavy-hitter salting** (`apply_smart_salting`)
 6. **One-liner optimization** (`simple_optimize`)
 7. **Before/after visualization** (`visualize_salting`)
 8. **Interactive tuning** (`interactive_optimize`)

 💡 _Tip for Noobs_: Run each cell in order. Adjust `sample_frac`, `threshold`, and `salt_count` to fit your data size and cluster.

In [None]:
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

from hexadruid import (
    HexaDruid,
    infer_schema,
    detect_skew,
    detect_keys,
    AutoParameterAdvisor,
    simple_optimize,
    visualize_salting,
    interactive_optimize)

spark = SparkSession.builder \
    .appName("HexaDruid Full Demo") \
    .getOrCreate()

### ⚙️ 2) Spark Configuration

 - **`spark.sql.shuffle.partitions`** should match your expected salt buckets to avoid tiny tasks.
 - **`spark.default.parallelism`** controls default partitions for RDD operations (e.g. heavy-hitter scan).

 **Best Practice:** set both to `min(salt_count, cluster_cores)`.

In [None]:
# Tune shuffle / parallelism for 8 buckets
spark.conf.set("spark.sql.shuffle.partitions", "8")
spark.conf.set("spark.default.parallelism", "8")

### 🛠️ 3) Create a Skewed DataFrame

 We simulate 100 K rows:
 - 80 % of `user_id` = "A" → a hot key
 - `amount` cycles 0–99 (uniform)

 _Dev Tip_: Use `.limit()` for quick tests on subsets

In [None]:
data = [("A" if i % 5 != 0 else f"U{i%10}", float(i % 100))
        for i in range(100_000)]

df = spark.createDataFrame(data, ["user_id", "amount"])

print("Total rows:", df.count())

df.groupBy("user_id").count().orderBy("count", ascending=False).show(3)

### 📐 4) Schema Inference & DRTree

 **`infer_schema(df, sample_frac)`**  
 - **`sample_frac`**: fraction of rows to collect (e.g. 0.01 = 1%)  
 - Builds a safe `StructType` via driver-side regex/JSON sniffing  
 - Returns `(typed_df, schema, dr_tree)`  

 **Output**:  
 - `schema.simpleString()` shows each column’s inferred Spark type  
 - `dr_tree.to_dict()` shows a trivial “all” branch if no skew detected

In [None]:
typed_df, schema, dr_tree = infer_schema(df, sample_frac=0.01)

print("✅ Inferred schema:", schema.simpleString())

print("🌲 DRTree:", dr_tree.to_dict())

### 🔍 5) Skew Column Discovery

 **`detect_skew(df, threshold, top_n)`**  
 - **`threshold`**: minimum IQR-based skew to include (0.0 to get any skew)  
 - **`top_n`**: return up to N columns by descending skew score  

 _Noobs_: set `threshold=0.0` to always get `top_n` candidates.

In [None]:
skew_cols = detect_skew(df, threshold=0.0, top_n=2)

print("🔍 Top skewed columns:", skew_cols)

### 🔑 6) Key Detection

 **`detect_keys(df, threshold, max_combo)`**  
 - **`threshold`**: uniqueness ratio (distinct–null)/total to qualify  
 - **`max_combo`**: max columns to test for composite keys  

 _Pro Tip_: lower `threshold` (e.g. 0.01) when you know keys aren’t 100 % unique.  
 By default it returns **the best** candidate if none meet the threshold.

In [None]:
keys = detect_keys(df, threshold=0.01, max_combo=2)

print("🔑 Detected key candidate(s):", keys)

### 📊 7) Parameter Recommendations

 **`AutoParameterAdvisor(df, skew_top_n, cat_top_n)`**  
 - Samples up to `max_sample` rows for lightning-fast metrics  
 - Recommends top skewed numeric & low-cardinality categorical columns  
 - Returns `(skew_cands, cat_cands, metrics_df)`  

 **Metrics Table**:  
 - `skew`: IQR-based score  
 - `distinct` / `nulls` count on the sample  

 _Developer Tip_: adjust `sample_frac` or `max_sample` in the class if you need larger samples.

In [None]:
advisor = AutoParameterAdvisor(df, skew_top_n=2, cat_top_n=2)

skew_cands, cat_cands, metrics_df = advisor.recommend()

print("Numeric candidates:", skew_cands)

print("Categorical candidates:", cat_cands)

metrics_df.show(truncate=False)

### ⚡ 8) Fast Heavy-Hitter Salting (Auto)

 **`apply_smart_salting(col_name=None, salt_count=None)`**  
 - **Auto-detects** the single most skewed column if `col_name=None`  
 - **Auto-sets** salt_count = `sparkContext.defaultParallelism` if `salt_count=None`  
 - **Heavy hitters** (> total/salt_count) get random buckets  
 - **Others** are hashed via `pmod(hash(key), salt_count)`  
 - Single full-table shuffle → near-linear performance  

In [None]:
hd = HexaDruid(df)

t0 = time.time()

df_auto = hd.apply_smart_salting()  # no args = auto mode

print(f"\n⚡ Auto salting took {time.time() - t0:.2f}s")

df_auto.groupBy("salt").count().orderBy("salt").show()

### 🔄 9) One-Liner Optimize for Beginners

 **`simple_optimize(df, skew_col, sample_frac, salt_count)`**  
 wraps `infer_schema` + `apply_smart_salting` in one call.  
 Returns the salted & repartitioned DataFrame.  

 _Example_: rebalance on `"user_id"` with 5 buckets.

In [None]:
t1 = time.time()

df_simple = simple_optimize(
    df,
    skew_col="user_id",
    sample_frac=0.005,  # 0.5% sample for type sniffing
    salt_count=5
)
print(f"\n🔧 simple_optimize took {time.time() - t1:.2f}s")

df_simple.groupBy("salt").count().orderBy("salt").show()

### 📈 10) Before/After Visualization

 **`visualize_salting(df, skew_col, salt_count)`**  
 - Prints original skew distribution  
 - Applies heavy-hitter salting  
 - Prints new bucket counts  

 Great for quick sanity checks without manual `show()`.

In [None]:
_ = visualize_salting(
    df,
    skew_col="user_id",
    salt_count=8)

### 👩‍💻 11) Interactive Tuning

 **`interactive_optimize(df, sample_frac, skew_top_n, cat_top_n)`**  
 1. Shows recommended skew & categorical columns with metrics  
 2. Prompts you to **pick any** column  
 3. Applies heavy-hitter salting on your choice  
 4. **Always** pick a low-cardinality column (e.g low distinct)

 Perfect for beginners who want guidance without writing code.

In [None]:
print("\n👩‍💻 Interactive optimize—pick any column:")

df_inter = interactive_optimize(df)

print("\n✅ Final salt distribution:")

df_inter.groupBy("salt").count().orderBy("salt").show()

## ✅ Summary & Best Practices

 - **Cache** intermediate DataFrames only when reused.  
 - **Align** `spark.sql.shuffle.partitions` to your salt buckets.  
 - **Sample** wisely—small fractions for large tables.  
 - **Lower thresholds** when exploring skew and keys.  
 - **Use heavy-hitter salting** to outperform multi-pass quantiles.  

 HexaDruid turns dozens of lines into self‐tuning, visual, Spark-native calls—perfect for noobs and experts alike! 🚀