## üöÄ Applying the New Dynamic Heavy-Hitter Salting

With the updated `apply_smart_salting` you can now salt **without any arguments**, and HexaDruid will:

1. **Auto-detect** the single most skewed column  
2. **Auto-choose** `salt_count` based on your cluster‚Äôs parallelism  
3. **Identify heavy hitters** and spread them evenly  
4. **Hash the rest** into balanced buckets

You can still override either or both if you want full control.

---

### Usage Patterns

| Call                                  | Behavior                                                                                  |
|---------------------------------------|-------------------------------------------------------------------------------------------|
| `hd.apply_smart_salting()`            | Auto-detect col + auto salt_count + heavy-hitter salting                                  |
| `hd.apply_smart_salting("user_id")`   | Use `"user_id"` and auto salt_count + heavy-hitter logic                                  |
| `hd.apply_smart_salting(salt_count=8)`| Auto-detect col and force 8 buckets + heavy-hitter logic                                  |
| `hd.apply_smart_salting("amt", 5)`    | Force both `"amt"` and 5 buckets (heavy hitters + hash)                                   |

### Imports & SparkSession

In [1]:
from pyspark.sql import SparkSession
from hexadruid import HexaDruid

#### 1 - Spark tuning: match 5 shuffle tasks to our bucket count

In [2]:
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", str(spark.sparkContext.defaultParallelism))


#### 2 -  Build skewed DataFrame

In [4]:
#    e.g. 80% user_id="A", rest U0‚ÄìU9
data = [("A" if i % 5 != 0 else f"U{i%10}", float(i % 100)) for i in range(100_000)]
df = spark.createDataFrame(data, schema=["user_id","amount"])

#### Initialize HexaDruid

In [5]:
hd = HexaDruid(df)

[INFO] Initialized HexaDruid (out=hexa_druid_outputs)


### 4a) Fully automatic heavy-hitter salting

In [None]:
df_auto = hd.apply_smart_salting()
print("\nAuto-detected & salted:")
df_auto.groupBy("salt").count().orderBy("salt").show()

[INFO] Auto-detected skew column: user_id
[INFO] Using salt_count=14
[INFO] Found heavy hitters: ['U5', 'U0', 'A']



üîç Auto-detected & salted:
+----+-----+
|salt|count|
+----+-----+
|   0| 7065|
|   1| 7158|
|   2| 7234|
|   3| 7129|
|   4| 7249|
|   5| 7266|
|   6| 7079|
|   7| 7334|
|   8| 7076|
|   9| 7079|
|  10| 7083|
|  11| 7085|
|  12| 7048|
|  13| 7115|
+----+-----+



### 4b) Override only the salt count (keep auto column)

In [None]:
df_salt8 = hd.apply_smart_salting(salt_count=8)
print("\nForced 8 buckets (auto column):")
df_salt8.groupBy("salt").count().orderBy("salt").show()

[INFO] Auto-detected skew column: user_id
[INFO] Using salt_count=8
[INFO] Found heavy hitters: ['A']



üîß Forced 8 buckets (auto column):
+----+-----+
|salt|count|
+----+-----+
|   0| 9937|
|   1|10066|
|   2|20132|
|   3|10007|
|   4|20086|
|   5| 9937|
|   6| 9943|
|   7| 9892|
+----+-----+



### 4c) Override only the column (keep auto salt_count)

In [None]:
df_on_amt = hd.apply_smart_salting("amount")
print("\nForced 'amount' column (auto buckets):")
df_on_amt.groupBy("salt").count().orderBy("salt").show()

[INFO] Using salt_count=14
[INFO] Found heavy hitters: []



üéØ Forced 'amount' column (auto buckets):
+----+-----+
|salt|count|
+----+-----+
|   0| 5000|
|   1| 7000|
|   2| 8000|
|   3| 6000|
|   4| 6000|
|   5| 3000|
|   6| 5000|
|   7| 5000|
|   8| 9000|
|   9|11000|
|  10|10000|
|  11| 8000|
|  12|10000|
|  13| 7000|
+----+-----+




### 4d) Full manual override

In [None]:
df_custom = hd.apply_smart_salting("user_id", salt_count=12)
print("\nCustom: 'user_id' + 12 buckets:")
df_custom.groupBy("salt").count().orderBy("salt").show()

[INFO] Using salt_count=12
[INFO] Found heavy hitters: ['U5', 'U0', 'A']



‚úçÔ∏è  Custom: 'user_id' + 12 buckets:
+----+-----+
|salt|count|
+----+-----+
|   0| 8260|
|   1| 8472|
|   2| 8269|
|   3| 8400|
|   4| 8515|
|   5| 8264|
|   6| 8550|
|   7| 8190|
|   8| 8272|
|   9| 8298|
|  10| 8240|
|  11| 8270|
+----+-----+

