In Fabric, a Python (Spark) notebook talks to the Spark engine via Livy. Every time you submit a job or command to Spark (e.g. `df.count()`, `df.show()`, `spark.read.table(...)`, `df.write...`), you send a request through Livy.

The main ideas here:
- Raw Python will execute row by row(eager execution), while Spark will create a plan using your instructions then execute(lazy execution). 
    - This means a tight python for-loop that iterates row-by-row for 2 spark changes can easily generate 200 livy calls on a small 100-row table.
- Many small Spark actions in Python loops = many Livy calls = easier to hit rate limits.
- Fewer, larger actions using spark syntax `withColumn`, joins, and unions means fewer Livy calls. 

## Sample data

In [1]:
from pyspark.sql import functions as F
data = [
    (1, 'A', 10.0),
    (2, 'B', 20.0),
    (3, 'C', 30.0),
    (4, 'D', 40.0),
]
df = spark.createDataFrame(data, ['id', 'category', 'value'])
df.show()

StatementMeta(, ed84ed17-aab4-461a-9b62-3705960acbbf, 3, Finished, Available, Finished)

+---+--------+-----+
| id|category|value|
+---+--------+-----+
|  1|       A| 10.0|
|  2|       B| 20.0|
|  3|       C| 30.0|
|  4|       D| 40.0|
+---+--------+-----+



## Per‑row spark work inside a python loop (chatty = bad)

This is the pattern that most easily triggers Livy limits.

Characteristics:
- A `for` loop in Python.
- Inside the loop you call Spark actions (`count`, `show`, `collect`, `write`, `spark.read...`).
- Each iteration submits a new Spark job  a new Livy call.

In Fabric, doing this for hundreds or thousands of rows / tables can quickly hit gateway limits.

In [None]:
# DO NOT USE THIS PATTERN IN PRODUCTION
from time import sleep

ids = [r.id for r in df.select('id').collect()]  # one Spark job here

for i in ids:
    # Each of these is a separate Spark job (and a Livy request in Fabric)
    subset = df.filter(F.col('id') == i)   # job when triggered by an action
    cnt = subset.count()                   # Spark action = Livy call
    print(f'id={i}, count={cnt}')
    sleep(0.1) # simulate some work

print('Finished chatty pe row loop')

## Single collect + pure python loop (no extra Livy calls, but not scalable)

Here we **only call Spark once** (`collect()`), then do everything else in pure Python. This does **not** introduce extra Livy calls, but it has two downsides:
- All data must fit in the driver memory.
- You lose the benefits of distributed execution.

Use this only for small data or debugging.

In [2]:
rows = df.collect()  # single Spark action -> one Livy call

total = 0.0
for row in rows:
    # Pure Python logic, no Spark actions here but loses the distributed execution (not really scalable)
    if row.category in ('A', 'B', 'C'):
        total += row.value * 1.1

print(f'Total (pure Python after collect): {total}')

StatementMeta(, ed84ed17-aab4-461a-9b62-3705960acbbf, 4, Finished, Available, Finished)

Total (pure Python after collect): 66.0


## lazy ops(vectorized) + a small number of actions (preferred)

Instead of looping row‑by‑row in Python, push the logic into Spark using expressions or User Defined Functions(UDFs are items just like notebooks when you get a chance to explore) The key is:
- Build a transformation plan with `withColumn`, `filter`, `groupBy`, `agg`, etc. (lazy operations).
- Trigger the work with a small number of actions (e.g. one `write`, one `count`).

This pattern keeps Livy calls to a minimum and lets Spark optimize the execution plan.

In [3]:
# Example: compute an adjusted_value using an expression
df_transformed = (df.withColumn('adjusted_value', 
                                F.when(F.col('category').isin('A', 'B', 'C'), F.col('value') * 1.1)
                                    .otherwise(F.col('value'))
    )
)

df_transformed.show() # One action: show or write – one Livy call

StatementMeta(, ed84ed17-aab4-461a-9b62-3705960acbbf, 5, Finished, Available, Finished)

+---+--------+-----+--------------+
| id|category|value|adjusted_value|
+---+--------+-----+--------------+
|  1|       A| 10.0|          11.0|
|  2|       B| 20.0|          22.0|
|  3|       C| 30.0|          33.0|
|  4|       D| 40.0|          40.0|
+---+--------+-----+--------------+



## Many small actions vs a single action

Even without explicit `for` loops, **many small actions** can be chatty:

```python
df.filter(...).count()
df.filter(...).count()
df.filter(...).count()
```

Each `count()` is a separate job and a separate Livy request. Contrast that with a more compact approach that aggregates once.

In [None]:
# Counts by category (chatty pattern)
for cat in ['A', 'B', 'C', 'D']:
    cnt = df.filter(F.col('category') == cat).count()  # separate job per iteration
    print(cat, cnt)

In [4]:
# Counts by category (single aggregation)
df.groupBy('category').count().show() # one agg -> one livy job

StatementMeta(, ed84ed17-aab4-461a-9b62-3705960acbbf, 6, Finished, Available, Finished)

+--------+-----+
|category|count|
+--------+-----+
|       A|    1|
|       B|    1|
|       C|    1|
|       D|    1|
+--------+-----+

