### PySpark vs Pandas comparison

- Pandas is like a personal laptop—great for individual tasks. PySpark is like a supercomputing lab—built to handle tasks that would crash a single machine.

**1. High level comparision**

| Feature |	Pandas |	PySpark |
|----|----|----|
| Architecture |	Single Node: Runs on one machine. |	Distributed: Runs on a cluster.
| Memory	| Loads all data into RAM. |	Distributes data across Nodes.
| Execution	| Eager: Executes code line-by-line.	| Lazy: Builds a plan (DAG) first.
| Dataset Size	| Small/Medium (up to ~10GB).	| Big Data (TB to PB).
| Performance |	Faster for small data (no overhead).	| Faster for massive data (parallelism).

**2. Syntax Side-by-Side**

| Operation | Pandas Syntax | PySpark Syntax
| ----- | ----- | ----- |
| Read CSV | pd.read_csv("file.csv") | spark.read.csv("file.csv", header=True)
| Select Columns | df[['name', 'age']] | df.select("name", "age")
| Filter | df[df['age'] > 21] | df.filter(df.age > 21)
| Add | Columndf['new'] = df['id']*2 | df.withColumn("new", df.id*2)
| GroupBy | df.groupby("cat").sum() | df.groupBy("cat").sum()
| Renaming | df.rename(columns={'a':'b'}) | df.withColumnRenamed("a", "b")
| Missing Values | df.fillna(0) | df.fillna(0) or df.na.fill(0)

**3. When should you switch?**

In the world of Databricks, you’ll often use both, but the "Rule of Thumb" is:
- Stay with Pandas if: Your data fits in your laptop's memory (usually < 2GB) and you need to do quick exploratory analysis or plotting.
- Switch to PySpark if: Your dataset is larger than your available RAM (e.g., a 50GB file). The processing takes too long on a single machine. You are building a production pipeline that needs to scale as the business grows.

**4. The "Best of Both Worlds" (Pandas API on Spark)**
Since 2022, Databricks has integrated something called Pandas API on Spark (formerly known as Koalas). It allows you to write Pandas code but have it run on the Spark engine.

```
import pyspark.pandas as ps

# This looks like Pandas, but runs distributed on a cluster!
df = ps.read_csv("s3://massive-bucket/data.csv")
df_filtered = df[df['price'] > 100]
```

### Joins (inner, left, right, outer)

**1. Join Types**
| Join Type |	What it returns	Visual Diagram
| ----- | -----
| Inner |	Only the rows where there is a match in both tables.	
| Left (Outer) |	All rows from the left table, plus matching rows from the right. (Non-matches get null).	
| Right (Outer) |	All rows from the right table, plus matching rows from the left. (Non-matches get null).	
| Full Outer |	All rows from both tables. It fills in null wherever a match is missing

**2. PySpark Syntax**

In PySpark, the syntax is very consistent. The most important part is the how parameter.
```
# Basic Join Syntax
joined_df = df_left.join(df_right, df_left.customer_id == df_right.user_id, how="inner")

# Common 'how' options:
# "inner", "left", "right", "outer" (or "full")
```

**3. The Spark Secret: Broadcast Joins**

- In a distributed environment, joining two massive tables requires a Shuffle, where Spark moves data between executors to find matches. This is slow.
- However, if one of your tables is small (e.g., a 10MB "Product Category" table) and the other is huge (e.g., 1TB "Sales" table), Databricks can perform a Broadcast Join.
- **How it works:** Spark sends the entire small table to every executor.
- **The Benefit:** The large table doesn't have to move! The join happens locally on each machine.
- **Result:** It can turn a 10-minute job into a 10-second job.
- Databricks usually does this automatically (Auto-broadcast), but you can force it in your code:

```
from pyspark.sql.functions import broadcast

# Tell Spark to send 'small_df' to all workers
joined_df = large_df.join(broadcast(small_df), "id")
```

### Window functions (running totals, rankings)

- Windows functions allow you to perform calculations across a set of rows (a "window") that are related to the current row—without collapsing them into a single row like a groupBy does.
- If a groupBy is like a summary report, a Window Function is like adding a "Running Total" or "Rank" column to your existing spreadsheet.

**1. The Anatomy of a Window**
- In PySpark, every window function requires a Window Specification. 
- This defines how the data is grouped and ordered before the calculation happens.
```
from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Define the "Spec"
windowSpec = Window.partitionBy("category").orderBy("sales_date")
```
- partitionBy: Defines the groups (e.g., "Calculate this for each product category").
- orderBy: Defines the sequence (e.g., "Sort by date so we can calculate a trend").
- rowsBetween: (Optional) Defines the "frame" or boundaries (e.g., "Only look at the last 7 days").

**2. Ranking Functions**
- Ranking is essential for "Top N" problems (e.g., "Find the top 3 best-selling products in every region").
| Function | Behavior | Result for tied values (10, 10, 12)
| ----- | ----- | ----- |
| row_number() | Unique sequential number.| 1, 2, 3
| rank() | Leaves gaps after ties. | 1, 1, 3
| dense_rank() | No gaps after ties. | 1, 1, 2

Example: Ranking products by price within each category
```
Pythondf.withColumn("rank", F.dense_rank().over(Window.partitionBy("category").orderBy(F.desc("price"))))
```

**3. Running Totals (Cumulative Sum)**
- To calculate a running total, you use a standard aggregation function (like sum) but apply it over a window.
- By default, if you provide an orderBy in your window spec, Spark assumes you want a running total from the start of the partition up to the current row.
- Example: Cumulative sales per day
```
windowSpec = Window.partitionBy("store_id").orderBy("date")

df.withColumn("running_total", F.sum("daily_sales").over(windowSpec))
```

**4. Analytical Functions (Lead & Lag)**
- These functions allow you to "look ahead" or "look back" at previous or future rows. This is incredibly useful for calculating **month-over-month growth.**
- **lag(col, 1):** Pulls the value from the previous row.
- **lead(col, 1):** Pulls the value from the next row.
- Example: Calculating Daily Growth
```
df.withColumn("prev_day_sales", F.lag("sales").over(windowSpec)).withColumn("growth", F.col("sales") - F.col("prev_day_sales"))
```

**5. Summary Table**
| Use Case | Recommended Function |
| ---- | ---- |
| Top 3 items per category | dense_rank() 
| Cumulative Revenue | sum().over(window)
| Year-over-Year (YoY) | Growthlag()
| Moving Average (7-day) | avg().over(window.rowsBetween(-6, 0))


### User-Defined Functions (UDFs)

Think of a UDF as a custom-made tool you build when the standard Spark toolbox doesn't have exactly what you need.

**1. How a UDF Works**
- When you define a UDF in Python, you are essentially telling Spark: "Take this Python function and apply it to every row in this column."

**The Python Syntax**
```
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# 1. Define a standard Python function
def categorize_price(price):
    if price > 100: return "Premium"
    else: return "Budget"

# 2. Register it as a UDF
price_udf = udf(categorize_price, StringType())

# 3. Use it in your DataFrame
df.withColumn("category", price_udf(df.price)).show()
```

**2. The "Python Tax" (Performance Warning)**
- Standard Python UDFs are notoriously slow. To understand why, we have to look at how Spark (which runs on the JVM) talks to Python.
- Serialization: Spark must "pickle" (serialize) the data into bytes.
- Data Movement: It sends those bytes from the JVM to a Python process on the worker node.
- Execution: Python runs the function row-by-row.
- Return: The result is serialized again and sent back to the JVM.
- The Golden Rule of Spark: Always check if a built-in function (like when().otherwise()) can do the job before reaching for a UDF. Built-in functions run directly in the JVM and are significantly faster.

**3. The Solution: Pandas UDFs (Vectorized)**
- To solve the performance bottleneck, Spark introduced Pandas UDFs. These use Apache Arrow to move data between the JVM and Python much more efficiently.
- Standard UDF: Processes data row-by-row.
- Pandas UDF: Processes data in batches (vectors).
- Pandas UDF Syntax
```
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def vectorized_categorize(price_series: pd.Series) -> pd.Series:
    # This runs on a whole batch of prices at once!
    return price_series.apply(lambda x: "Premium" if x > 100 else "Budget")

df.withColumn("category", vectorized_categorize(df.price))
```

**4. Comparison Table**

| Feature | Standard Python UDF | Pandas UDF (Vectorized) | Built-in Functions 
| ----- | ----- | ----- | ----- 
| Performance | Slow (Row-by-row) | Fast (Batched) | Fastest (Native JVM)
| Complexity | Easy to write. | Requires Pandas knowledge. | Easiest (if available).
| Use Case | Complex, non-math logic. | Statistical/ML operations. | 90% of standard ETL.

**5. When to use UDFs in Databricks**
- Use Built-in Functions: For 95% of tasks (math, string cleaning, date manipulation).
- Use Pandas UDFs: When you need to use libraries like scipy, numpy, or statsmodels on your data.
- Use Standard UDFs: Only when the logic is highly specialized, doesn't involve heavy math, and cannot be vectorized.

In [0]:
## load full ecommerce dataset

# Define the path to your downloaded CSV
file_path = "/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv"

# Read the file with correct options
df = (spark.read
      .format("csv")
      .option("header", "true")        # Uses the first row as column names
      .option("inferSchema", "true")   # Automatically detects data types (e.g., price as double)
      .load(file_path))

In [0]:
# Define the path to your downloaded CSV
file_path = "/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv"

# Read the file with correct options
df_n = (spark.read
      .format("csv")
      .option("header", "true")        # Uses the first row as column names
      .option("inferSchema", "true")   # Automatically detects data types (e.g., price as double)
      .load(file_path))

In [0]:
df_n.count()

In [0]:
df_oct = df.limit(1000)
df_nov = df_n.limit(1000)

In [0]:
## joins

## inner join
## Returns only products sold in both months
inner_df = df_oct.join(df_nov, on="product_id", how="inner")
inner_df.count()

In [0]:
## left join
## keeps all the products from October
left_df = df_oct.join(df_nov, on="product_id", how="left")
left_df.count()

In [0]:
## full join
## Every product that appeared in either month

full_df = df_oct.join(df_nov, on="product_id", how="full")
full_df.count()

In [0]:
## semi join
## October sales for products that we also managed to sell in November.

semi_df = df_oct.join(df_nov, on="product_id", how="semi")
semi_df.count()

In [0]:
## products we sold in October but lost in November
anti_df = df_oct.join(df_nov, on="product_id", how="anti")
anti_df.count()

In [0]:
## aliasing

from pyspark.sql import functions as F

joined_df = df_oct.alias("oct").join(
    df_nov.alias("nov"),
    on="product_id",
    how="inner"
).select(
    "product_id",
    F.col("oct.price").alias("oct_price"),
    F.col("nov.price").alias("nov_price")
)

joined_df.count()

In [0]:
## cross join

cross_df = df_oct.crossJoin(df_nov)
cross_df.count()

In [0]:
## cumulative revenue per day

from pyspark.sql import functions as F
from pyspark.sql.window import Window

## group by date to get daily totals
daily_sales = df.groupBy("event_time").agg(F.sum("price").alias("daily_revenue"))

## define window specification
window_spec = Window.orderBy("event_time")

## apply the running total
running_total_df = daily_sales.withColumn("running_total", F.sum("daily_revenue").over(window_spec))
running_total_df.orderBy("event_time").show()

In [0]:
## time based features

df_features = df.withColumn("event_time", F.to_timestamp("event_time")).withColumn(
    "hour", F.hour("event_time")).withColumn(
        "day_of_week", F.dayofweek("event_time")).withColumn(
            "is_weekend", F.when(F.col("day_of_week").isin(1,7),1).otherwise(0))

df_features.limit(10).show()

In [0]:
# define cast window by category
cast_window = Window.partitionBy("category_id")

df_features = df_features.withColumn("avg_category_price", F.avg("price").over(cast_window)).withColumn(
    "price_diff_from_avg", F.col("price") - F.col("avg_category_price")
)

df_features.select("avg_category_price", "price_diff_from_avg").limit(10).show()