## ‚ö†Ô∏è Per chi viene da Pandas/ML

Se conosci **Pandas**, Spark √® simile ma con differenze chiave:
- **Distribuito**: i dati sono su pi√π macchine, non su una sola
- **Lazy**: Spark non esegue subito, pianifica ed esegue quando serve
- **Immutabile**: ogni trasformazione crea un nuovo DataFrame

### üîë Equivalenze rapide:
```python
# Pandas ‚Üí Spark
df.head()               ‚Üí df.show()
df.info()               ‚Üí df.printSchema()
len(df)                 ‚Üí df.count()
df[df['age'] > 18]      ‚Üí df.filter(col('age') > 18)
df[['col1', 'col2']]    ‚Üí df.select('col1', 'col2')
df.groupby('x').sum()   ‚Üí df.groupBy('x').sum()
```

### ‚õî MAI fare:
```python
# ‚ùå NON convertire a Pandas per operazioni su grandi dataset
df.toPandas().groupby('col').sum()  # SBAGLIATO: porta tutto in memoria!

# ‚úÖ USA Spark distribuito
df.groupBy('col').sum()  # CORRETTO: resta distribuito
```

üìò **Guida completa**: Vedi [PANDAS_TO_SPARK.ipynb](PANDAS_TO_SPARK.ipynb) per capire le differenze in dettaglio.

---

# üîß Databricks Cheatsheet - In ordine di complessit√†

## üìä 1. Lettura Dati Base

In [None]:
# Lettura file CSV
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Lettura JSON
df = spark.read.json("path/to/file.json")

# Lettura Parquet
df = spark.read.parquet("path/to/file.parquet")

# Da tabella
df = spark.read.table("database.table_name")

## üîç 2. Esplorazione Base

In [None]:
# Info struttura
df.printSchema()    # Schema colonne
df.show(5)          # Prime 5 righe
df.count()          # Totale righe
df.columns          # Lista colonne

# Statistiche base
df.describe().show()
df.select("column_name").distinct().show()

## üéØ 3. Selezione e Filtri

In [None]:
# Selezione colonne
df.select("col1", "col2").show()

# Filtri base
df.filter(df.age > 18).show()
df.filter((df.age > 18) & (df.city == "Rome")).show()
df.where(df.status == "active").show()

# Limit
df.limit(100).show()

## üîÑ 4. Trasformazioni Base

In [None]:
from pyspark.sql.functions import *

# Nuove colonne
df.withColumn("age_plus_10", col("age") + 10).show()
df.withColumn("constant", lit("value")).show()

# Rinominare
df.withColumnRenamed("old_name", "new_name").show()

# Ordinamento
df.orderBy("age").show()
df.orderBy(desc("age")).show()

# Drop colonne
df.drop("unwanted_col").show()

## üìä 5. Aggregazioni

In [None]:
# Group By base
df.groupBy("category").count().show()
df.groupBy("category").sum("amount").show()
df.groupBy("category").avg("price").show()

# Aggregazioni multiple
df.groupBy("category").agg(
    count("*").alias("total_records"),
    sum("amount").alias("total_amount"),
    avg("price").alias("avg_price")
).show()

## üîó 6. Join Base

In [None]:
# Join tipi
df1.join(df2, "common_column").show()           # Inner
df1.join(df2, "common_column", "left").show()   # Left
df1.join(df2, "common_column", "right").show()  # Right
df1.join(df2, "common_column", "outer").show()  # Full outer

# Join con colonne diverse
df1.join(df2, df1.id == df2.user_id).show()

## üíæ 7. Salvataggio Base

In [None]:
# Salva tabella
df.write.saveAsTable("my_database.my_table")

# Salva file
df.write.parquet("path/to/output.parquet")
df.write.csv("path/to/output.csv", header=True)

# Modalit√† scrittura
df.write.mode("overwrite").saveAsTable("table")
df.write.mode("append").saveAsTable("table")

## üóÇÔ∏è 8. SQL Magic

In [None]:
# Registra come vista
df.createOrReplaceTempView("my_data")

# Usa SQL
result = spark.sql("SELECT * FROM my_data WHERE age > 25")

# Query complesse
spark.sql("""
    SELECT category, COUNT(*) as count, AVG(price) as avg_price
    FROM my_data 
    GROUP BY category
    ORDER BY avg_price DESC
""").show()

## üóÑÔ∏è 9. Unity Catalog - Database/Tabelle

In [None]:
# Creare database
spark.sql("CREATE DATABASE IF NOT EXISTS my_database")
spark.sql("USE my_database")

# Creare tabella da DataFrame
df.write.saveAsTable("my_database.customers")

# Creare tabella vuota
spark.sql("""
    CREATE TABLE IF NOT EXISTS my_database.orders (
        order_id INT,
        customer_id INT,
        order_date DATE,
        amount DECIMAL(10,2)
    )
""")

# Info tabelle
spark.sql("SHOW TABLES IN my_database").show()
spark.sql("DESCRIBE my_database.customers").show()

## üî∫ 10. Delta Lake Operations

In [None]:
from delta.tables import DeltaTable

# Salva come Delta
df.write.format("delta").saveAsTable("my_database.delta_table")

# Merge (UPSERT)
delta_table = DeltaTable.forName(spark, "my_database.customers")
delta_table.alias("target").merge(
    new_data.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

# Time Travel
spark.sql("SELECT * FROM my_database.customers VERSION AS OF 1").show()

# Ottimizzazioni
spark.sql("OPTIMIZE my_database.customers")
spark.sql("OPTIMIZE my_database.customers ZORDER BY (customer_id)")
spark.sql("VACUUM my_database.customers")

## üîó 11. Storage Mounting

In [None]:
# Mount Azure Data Lake (ADLS)
configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": "your_client_id",
    "fs.azure.account.oauth2.client.secret": "your_client_secret",
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/your_tenant_id/oauth2/token"
}

dbutils.fs.mount(
    source="abfss://container@storageaccount.dfs.core.windows.net/",
    mount_point="/mnt/datalake",
    extra_configs=configs
)

# Mount S3
dbutils.fs.mount(
    source="s3a://your-bucket-name",
    mount_point="/mnt/s3bucket",
    extra_configs={
        "fs.s3a.access.key": "your_access_key",
        "fs.s3a.secret.key": "your_secret_key"
    }
)

# Gestione mount
dbutils.fs.mounts()                    # Lista mount
dbutils.fs.unmount("/mnt/datalake")   # Unmount

## üöÄ 12. Performance Tips (SOLO SPARK!)

In [None]:
# ‚ùå MAI usare pandas!
# ‚ùå SBAGLIATO: df.toPandas().groupby('col').sum()
# ‚úÖ CORRETTO: df.groupBy('col').sum()

# Cache per performance
df.cache()    # Mantiene in memoria
df.persist()  # Pi√π controllo
df.unpersist() # Rimuove cache

# Broadcast join (tabelle < 200MB)
from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "key").show()

# Partizionamento
df.repartition(200)  # Ridistribuisce
df.coalesce(50)      # Riduce partizioni

# Configurazioni
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

## üîÑ 13. ETL Patterns Avanzati

In [None]:
# Window Functions
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank

window_spec = Window.partitionBy("customer_id").orderBy(desc("order_date"))
df.withColumn("row_num", row_number().over(window_spec)).show()

# Deduplicazione
df.dropDuplicates(["customer_id", "email"]).show()

# Slowly Changing Dimension (SCD Type 2)
# Pattern complesso per tracking cambiamenti storici
def scd_type2_update(target_table, source_df, key_cols):
    # Chiudi record vecchi, inserisci nuovi
    # Implementazione complessa per produzione
    pass

## üìù 14. Template Esercizio Standard

In [None]:
# TEMPLATE STANDARD PER OGNI ESERCIZIO

# 1. Leggi i dati
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# 2. Esplora sempre prima!
df.printSchema()
df.show(5)
print(f"Righe totali: {df.count()}")

# 3. Pulisci se necessario
df_clean = df.filter(df.column.isNotNull())

# 4. Trasforma
df_transformed = df_clean.withColumn("new_col", col("old_col") * 2)

# 5. Analizza
result = df_transformed.groupBy("category").agg(
    count("*").alias("count"),
    avg("value").alias("avg_value")
)

# 6. Mostra risultato
result.show()

# 7. Salva se richiesto
# result.write.saveAsTable("my_database.results")