# Comparing Python (pandas) and PySpark Syntax

This notebook demonstrates key differences between standard Python (using pandas) and PySpark DataFrames for data analysis. Each section includes side-by-side examples for beginners.

## 1. Setup: Load Data

We'll use the same customer dataset for both pandas and PySpark.

In [0]:
# Python (pandas): Load table from Databricks using SQL connection
# You will need to install Databricks SQL connector and provide your credentials.
# Uncomment and fill in your own workspace and credentials.

# import pandas as pd
# from databricks import sql
# query = 'SELECT * FROM my_database.my_schema.customers'
# customers_pd = pd.read_sql(query, connection)
# customers_pd.head()

In [0]:
# PySpark setup
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
customers_spark = spark.read.table("my_database.my_schema.customers")
customers_spark.show()

## 2. Data Types and Variable Declaration

**Python:**

In [0]:
a = 10  # integer
b = 3.14  # float
name = "Alice"  # string
print(a, b, name)

**PySpark:** (Variables are the same, but for data, use DataFrame schema)

In [0]:
# Checking data types in PySpark DataFrame
customers_spark.printSchema()

## 3. If-Else and Loops

**Python:**

In [0]:
if a > 5:
    print("a is greater than 5")
else:
    print("a is 5 or less")

for i in range(3):
    print(f"Loop iteration {i}")

**PySpark:** (Avoid row-by-row loops—use DataFrame operations for speed!)

In [0]:
# Example: Adding a new column using an if-else logic (vectorized operation)
from pyspark.sql.functions import when
customers_spark = customers_spark.withColumn(
    "age_group",
    when(customers_spark.age.cast('int') > 30, "Senior").otherwise("Junior")
)
customers_spark.select("name", "age", "age_group").show()

## 4. Filtering Rows

**Python:**

In [0]:
#customers_spark[customers_spark["age"] > 30]

**PySpark:**

In [0]:
customers_spark.filter(customers_spark.age.cast('int') > 30).show()

## 5. Aggregations (GroupBy)

**Python:**

In [0]:
#customers_pd.groupby("gender")["purchase_amount"].sum()

**PySpark:**

In [0]:
from pyspark.sql import functions as F
customers_spark.groupBy("gender").agg(F.sum("purchase_amount")).show()

## 6. Sorting

**Python:**

In [0]:
#customers_pd.sort_values(by="purchase_amount", ascending=False)

**PySpark:**

In [0]:
customers_spark.orderBy(F.col("purchase_amount").desc()).show()

## 7. Joins

**Python:**

In [0]:
# Create another DataFrame for joins
# cities_pd = pd.DataFrame({
#     "location": ["New York", "Los Angeles", "Chicago", "Houston", "Miami"],
#     "region": ["East", "West", "Midwest", "South", "South"]
# })

# customers_pd.merge(cities_pd, on="location", how="inner")

**PySpark:**

In [0]:
from pyspark.sql import Row
city_data = [
    Row(location="New York", region="East"),
    Row(location="Los Angeles", region="West"),
    Row(location="Chicago", region="Midwest"),
    Row(location="Houston", region="South"),
    Row(location="Miami", region="South")
]
cities_spark = spark.createDataFrame(city_data)
customers_spark.join(cities_spark, on="location", how="inner").show()

# Summary: Key Differences

- **Python/pandas** is great for small data, uses familiar syntax, and runs on one machine.
- **PySpark** is designed for large datasets and distributed processing; you use DataFrame operations instead of loops for speed and scalability.
- Many operations look similar, but PySpark functions are optimized for big data and often require explicit function calls (e.g., `F.sum()` instead of `.sum()`).

Experiment with both to understand when to use each!