# ‚ö° Processamento com Apache Spark - Big Data Finance

Este notebook demonstra como usar Apache Spark para processamento distribu√≠do de dados financeiros.

## Objetivos
- Configurar e inicializar Spark Session
- Carregar dados financeiros no Spark
- Realizar transforma√ß√µes distribu√≠das
- Calcular indicadores t√©cnicos
- Salvar resultados no HDFS
- Demonstrar otimiza√ß√µes de performance

**Autor:** Ana Luiza Pazze (Arquitetura e Infraestrutura) & Equipe Big Data Finance  
**Gest√£o:** Fabio  
**Processamento Spark:** Ana Luiza Pazze  
**Data:** 2024

In [1]:
# Imports necess√°rios
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Adicionar src ao path
sys.path.append('../src')

# Imports do Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.sql.types import *

# Imports dos m√≥dulos do projeto
from infrastructure.spark_manager import SparkManager
from infrastructure.hdfs_manager import HDFSManager

print("‚úÖ Imports realizados com sucesso!")

‚úÖ Imports realizados com sucesso!


## 1. ‚öôÔ∏è Configura√ß√£o do Ambiente Spark

Vamos configurar e inicializar o Spark para processamento distribu√≠do.

In [2]:
# Inicializar Spark Manager
spark_manager = SparkManager(app_name="BigDataFinance_Processing")

# Criar Spark Session otimizada
spark = spark_manager.create_spark_session(
    master="local[*]",
    executor_memory="2g",
    driver_memory="2g"
)

print("‚úÖ Spark Session criada com sucesso!")
print(f"üîß Spark Version: {spark.version}")
print(f"üîß Application ID: {spark.sparkContext.applicationId}")
print(f"üîß Master: {spark.sparkContext.master}")

# Configura√ß√µes do Spark
print("\n‚öôÔ∏è Configura√ß√µes do Spark:")
print("=" * 40)
configs = spark.sparkContext.getConf().getAll()
for key, value in sorted(configs):
    if 'spark.sql' in key or 'spark.executor' in key or 'spark.driver' in key:
        print(f"{key}: {value}")

Erro ao criar sess√£o Spark: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed
	at java.base/javax.security.auth.Subject.getSubject(Subject.java:347)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:588)
	at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2446)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2446)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:339)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:501)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructo

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed
	at java.base/javax.security.auth.Subject.getSubject(Subject.java:347)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:588)
	at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2446)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2446)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:339)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:501)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:485)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:1575)


## 2. üìä Carregamento de Dados

Vamos carregar os dados financeiros coletados anteriormente no Spark.

In [None]:
# Verificar se existem dados para carregar
data_dir = '../data/raw'
stock_files = [f for f in os.listdir(data_dir) if f.startswith('stock_data_') and f.endswith('.csv')]

if stock_files:
    # Usar o arquivo mais recente
    latest_file = sorted(stock_files)[-1]
    stock_file_path = os.path.join(data_dir, latest_file)
    print(f"üìÅ Carregando dados de: {latest_file}")
else:
    print("‚ö†Ô∏è Nenhum arquivo de dados encontrado. Execute primeiro o notebook 01_data_collection_example.ipynb")
    # Criar dados de exemplo para demonstra√ß√£o
    print("üîß Criando dados de exemplo...")
    
    # Gerar dados sint√©ticos
    dates = pd.date_range(start='2023-01-01', end='2024-01-01', freq='D')
    symbols = ['AAPL', 'GOOGL', 'MSFT', 'TSLA', 'AMZN']
    
    data = []
    for symbol in symbols:
        base_price = np.random.uniform(100, 300)
        for date in dates:
            price_change = np.random.normal(0, 0.02)
            base_price *= (1 + price_change)
            
            data.append({
                'date': date.strftime('%Y-%m-%d'),
                'symbol': symbol,
                'open': base_price * np.random.uniform(0.98, 1.02),
                'high': base_price * np.random.uniform(1.00, 1.05),
                'low': base_price * np.random.uniform(0.95, 1.00),
                'close': base_price,
                'volume': np.random.randint(1000000, 10000000)
            })
    
    # Salvar dados sint√©ticos
    os.makedirs(data_dir, exist_ok=True)
    stock_file_path = os.path.join(data_dir, 'stock_data_synthetic.csv')
    pd.DataFrame(data).to_csv(stock_file_path, index=False)
    print(f"‚úÖ Dados sint√©ticos criados: {stock_file_path}")

In [None]:
# Carregar dados no Spark
print("üìä Carregando dados no Spark...")

# Definir schema para otimizar carregamento
schema = StructType([
    StructField("date", StringType(), True),
    StructField("symbol", StringType(), True),
    StructField("open", DoubleType(), True),
    StructField("high", DoubleType(), True),
    StructField("low", DoubleType(), True),
    StructField("close", DoubleType(), True),
    StructField("volume", LongType(), True)
])

# Carregar DataFrame
df = spark.read.csv(
    stock_file_path,
    header=True,
    schema=schema
)

# Converter coluna de data
df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))

# Cache para melhor performance
df.cache()

print(f"‚úÖ Dados carregados: {df.count():,} registros")
print(f"üìã Schema:")
df.printSchema()

# Visualizar primeiros registros
print("\nüìä Primeiros registros:")
df.show(10)

## 3. üîÑ Transforma√ß√µes B√°sicas

Vamos realizar transforma√ß√µes b√°sicas nos dados usando Spark SQL.

In [None]:
# Registrar DataFrame como tabela tempor√°ria
df.createOrReplaceTempView("stock_data")

# Estat√≠sticas b√°sicas por s√≠mbolo
print("üìä Estat√≠sticas por S√≠mbolo:")
stats_df = spark.sql("""
    SELECT 
        symbol,
        COUNT(*) as records,
        MIN(date) as start_date,
        MAX(date) as end_date,
        ROUND(AVG(close), 2) as avg_price,
        ROUND(MIN(close), 2) as min_price,
        ROUND(MAX(close), 2) as max_price,
        ROUND(AVG(volume), 0) as avg_volume
    FROM stock_data 
    GROUP BY symbol
    ORDER BY symbol
""")

stats_df.show()

# Converter para Pandas para visualiza√ß√£o
stats_pandas = stats_df.toPandas()
print(f"\nüìà Resumo: {len(stats_pandas)} s√≠mbolos analisados")

## 4. üìà C√°lculo de Indicadores T√©cnicos

Vamos calcular indicadores t√©cnicos usando Window Functions do Spark.

In [None]:
# Definir window specifications
window_spec = Window.partitionBy("symbol").orderBy("date")
window_7d = Window.partitionBy("symbol").orderBy("date").rowsBetween(-6, 0)
window_20d = Window.partitionBy("symbol").orderBy("date").rowsBetween(-19, 0)
window_50d = Window.partitionBy("symbol").orderBy("date").rowsBetween(-49, 0)

print("üìä Calculando indicadores t√©cnicos...")

# Calcular indicadores
df_indicators = df.withColumn(
    "daily_return", 
    (col("close") / lag("close", 1).over(window_spec) - 1) * 100
).withColumn(
    "sma_7", 
    avg("close").over(window_7d)
).withColumn(
    "sma_20", 
    avg("close").over(window_20d)
).withColumn(
    "sma_50", 
    avg("close").over(window_50d)
).withColumn(
    "volatility_7d", 
    stddev("daily_return").over(window_7d)
).withColumn(
    "price_change", 
    col("close") - col("open")
).withColumn(
    "price_range", 
    col("high") - col("low")
).withColumn(
    "volume_sma_20", 
    avg("volume").over(window_20d)
)

# Cache do resultado
df_indicators.cache()

print("‚úÖ Indicadores calculados!")
print("\nüìä Exemplo de dados com indicadores:")
df_indicators.select(
    "date", "symbol", "close", "daily_return", 
    "sma_7", "sma_20", "volatility_7d"
).filter(
    col("symbol") == "AAPL"
).orderBy(
    desc("date")
).show(10)

### 4.1 An√°lise de Sinais de Trading

In [None]:
# Gerar sinais de trading baseados em m√©dias m√≥veis
print("üìà Gerando sinais de trading...")

df_signals = df_indicators.withColumn(
    "signal",
    when(col("sma_7") > col("sma_20"), "BUY")
    .when(col("sma_7") < col("sma_20"), "SELL")
    .otherwise("HOLD")
).withColumn(
    "trend",
    when(col("close") > col("sma_50"), "UPTREND")
    .when(col("close") < col("sma_50"), "DOWNTREND")
    .otherwise("SIDEWAYS")
).withColumn(
    "volatility_level",
    when(col("volatility_7d") > 3.0, "HIGH")
    .when(col("volatility_7d") > 1.5, "MEDIUM")
    .otherwise("LOW")
)

# An√°lise de sinais por s√≠mbolo
signals_summary = spark.sql("""
    SELECT 
        symbol,
        signal,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY symbol), 2) as percentage
    FROM (
        SELECT symbol, 
               CASE 
                   WHEN sma_7 > sma_20 THEN 'BUY'
                   WHEN sma_7 < sma_20 THEN 'SELL'
                   ELSE 'HOLD'
               END as signal
        FROM stock_data_indicators
        WHERE sma_7 IS NOT NULL AND sma_20 IS NOT NULL
    )
    GROUP BY symbol, signal
    ORDER BY symbol, signal
""")

# Registrar nova tabela
df_signals.createOrReplaceTempView("stock_data_indicators")

print("üìä Distribui√ß√£o de Sinais por S√≠mbolo:")
signals_summary.show()

## 5. üîç An√°lises Avan√ßadas com Spark SQL

Vamos realizar an√°lises mais complexas usando Spark SQL.

In [None]:
# An√°lise de performance mensal
print("üìÖ An√°lise de Performance Mensal:")

monthly_performance = spark.sql("""
    WITH monthly_data AS (
        SELECT 
            symbol,
            YEAR(date) as year,
            MONTH(date) as month,
            FIRST_VALUE(close) OVER (
                PARTITION BY symbol, YEAR(date), MONTH(date) 
                ORDER BY date
            ) as month_open,
            LAST_VALUE(close) OVER (
                PARTITION BY symbol, YEAR(date), MONTH(date) 
                ORDER BY date 
                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
            ) as month_close,
            AVG(volume) as avg_volume
        FROM stock_data_indicators
    )
    SELECT DISTINCT
        symbol,
        year,
        month,
        ROUND((month_close / month_open - 1) * 100, 2) as monthly_return,
        ROUND(avg_volume, 0) as avg_volume
    FROM monthly_data
    ORDER BY symbol, year, month
""")

monthly_performance.show(20)

# Estat√≠sticas de performance
print("\nüìä Estat√≠sticas de Performance:")
performance_stats = spark.sql("""
    WITH monthly_returns AS (
        SELECT 
            symbol,
            ROUND((month_close / month_open - 1) * 100, 2) as monthly_return
        FROM (
            SELECT DISTINCT
                symbol,
                YEAR(date) as year,
                MONTH(date) as month,
                FIRST_VALUE(close) OVER (
                    PARTITION BY symbol, YEAR(date), MONTH(date) 
                    ORDER BY date
                ) as month_open,
                LAST_VALUE(close) OVER (
                    PARTITION BY symbol, YEAR(date), MONTH(date) 
                    ORDER BY date 
                    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                ) as month_close
            FROM stock_data_indicators
        )
    )
    SELECT 
        symbol,
        ROUND(AVG(monthly_return), 2) as avg_monthly_return,
        ROUND(STDDEV(monthly_return), 2) as volatility,
        ROUND(MIN(monthly_return), 2) as worst_month,
        ROUND(MAX(monthly_return), 2) as best_month,
        COUNT(*) as months_analyzed
    FROM monthly_returns
    GROUP BY symbol
    ORDER BY avg_monthly_return DESC
""")

performance_stats.show()

### 5.1 An√°lise de Correla√ß√µes

In [None]:
# Calcular correla√ß√µes entre ativos
print("üîó An√°lise de Correla√ß√µes entre Ativos:")

# Pivot dos retornos di√°rios
returns_pivot = spark.sql("""
    SELECT 
        date,
        MAX(CASE WHEN symbol = 'AAPL' THEN daily_return END) as AAPL,
        MAX(CASE WHEN symbol = 'GOOGL' THEN daily_return END) as GOOGL,
        MAX(CASE WHEN symbol = 'MSFT' THEN daily_return END) as MSFT,
        MAX(CASE WHEN symbol = 'TSLA' THEN daily_return END) as TSLA,
        MAX(CASE WHEN symbol = 'AMZN' THEN daily_return END) as AMZN
    FROM stock_data_indicators
    WHERE daily_return IS NOT NULL
    GROUP BY date
    ORDER BY date
""")

returns_pivot.cache()
print("üìä Matriz de retornos criada")
returns_pivot.show(10)

# Converter para Pandas para calcular correla√ß√µes
returns_pandas = returns_pivot.toPandas().set_index('date')
correlation_matrix = returns_pandas.corr()

print("\nüîó Matriz de Correla√ß√£o:")
print(correlation_matrix.round(3))

## 6. üíæ Salvamento no HDFS

Vamos salvar os dados processados no HDFS para uso posterior.

In [None]:
# Inicializar HDFS Manager (simulado para ambiente local)
print("üíæ Preparando salvamento dos dados processados...")

# Criar diret√≥rio de sa√≠da
output_dir = '../data/processed'
os.makedirs(output_dir, exist_ok=True)

# Salvar dados com indicadores
print("üìä Salvando dados com indicadores t√©cnicos...")
indicators_path = f"{output_dir}/stock_indicators"

# Salvar como Parquet (formato otimizado)
df_signals.coalesce(1).write.mode("overwrite").parquet(indicators_path)
print(f"‚úÖ Dados salvos em: {indicators_path}")

# Salvar estat√≠sticas de performance
print("üìà Salvando estat√≠sticas de performance...")
performance_path = f"{output_dir}/performance_stats"
performance_stats.coalesce(1).write.mode("overwrite").parquet(performance_path)
print(f"‚úÖ Estat√≠sticas salvas em: {performance_path}")

# Salvar dados mensais
print("üìÖ Salvando dados de performance mensal...")
monthly_path = f"{output_dir}/monthly_performance"
monthly_performance.coalesce(1).write.mode("overwrite").parquet(monthly_path)
print(f"‚úÖ Dados mensais salvos em: {monthly_path}")

# Salvar como CSV tamb√©m para compatibilidade
print("üìÑ Salvando vers√µes CSV...")
df_signals.coalesce(1).write.mode("overwrite").option("header", "true").csv(f"{output_dir}/stock_indicators_csv")
performance_stats.coalesce(1).write.mode("overwrite").option("header", "true").csv(f"{output_dir}/performance_stats_csv")

print("\n‚úÖ Todos os dados processados foram salvos com sucesso!")

## 7. üìä Visualiza√ß√µes dos Resultados

Vamos criar algumas visualiza√ß√µes dos dados processados.

In [None]:
# Converter dados para visualiza√ß√£o
performance_pandas = performance_stats.toPandas()
monthly_pandas = monthly_performance.toPandas()

# Gr√°fico de performance m√©dia mensal
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
performance_pandas.set_index('symbol')['avg_monthly_return'].plot(kind='bar', color='skyblue')
plt.title('üìà Retorno M√©dio Mensal por Ativo', fontweight='bold')
plt.ylabel('Retorno (%)')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

# Gr√°fico de volatilidade
plt.subplot(1, 2, 2)
performance_pandas.set_index('symbol')['volatility'].plot(kind='bar', color='coral')
plt.title('üìä Volatilidade Mensal por Ativo', fontweight='bold')
plt.ylabel('Volatilidade (%)')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Heatmap de correla√ß√µes
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, 
            annot=True, 
            cmap='RdBu_r', 
            center=0,
            square=True, 
            linewidths=0.5,
            fmt='.3f')
plt.title('üîó Matriz de Correla√ß√£o dos Retornos (Spark Processing)', fontweight='bold')
plt.tight_layout()
plt.show()

## 8. ‚ö° Otimiza√ß√µes de Performance

Vamos demonstrar algumas otimiza√ß√µes importantes do Spark.

In [None]:
# An√°lise de performance do Spark
print("‚ö° An√°lise de Performance do Spark")
print("=" * 40)

# Informa√ß√µes sobre cache
print("üíæ DataFrames em Cache:")
cached_tables = spark.catalog.listTables()
for table in cached_tables:
    if table.isTemporary:
        print(f"  - {table.name}")

# Estat√≠sticas do Spark Context
sc = spark.sparkContext
print(f"\nüìä Estat√≠sticas do Spark Context:")
print(f"  - Application ID: {sc.applicationId}")
print(f"  - Default Parallelism: {sc.defaultParallelism}")
print(f"  - Status: {sc.statusTracker().getExecutorInfos()}")

# Exemplo de particionamento otimizado
print("\nüîß Otimiza√ß√£o de Particionamento:")
print(f"Parti√ß√µes atuais do DataFrame: {df_signals.rdd.getNumPartitions()}")

# Reparticionamento por s√≠mbolo para otimizar opera√ß√µes por grupo
df_optimized = df_signals.repartition(col("symbol"))
print(f"Parti√ß√µes ap√≥s reparticionamento: {df_optimized.rdd.getNumPartitions()}")

# Exemplo de broadcast join (simulado)
print("\nüì° Exemplo de Broadcast Join:")
# Criar pequena tabela de metadados
metadata = spark.createDataFrame([
    ("AAPL", "Technology", "Apple Inc."),
    ("GOOGL", "Technology", "Alphabet Inc."),
    ("MSFT", "Technology", "Microsoft Corp."),
    ("TSLA", "Automotive", "Tesla Inc."),
    ("AMZN", "E-commerce", "Amazon.com Inc.")
], ["symbol", "sector", "company_name"])

# Broadcast da tabela pequena
from pyspark.sql.functions import broadcast
df_with_metadata = df_signals.join(
    broadcast(metadata), 
    "symbol", 
    "left"
)

print("‚úÖ Broadcast join configurado para otimizar performance")

## 9. üîç Monitoramento e Debugging

Vamos ver como monitorar jobs do Spark.

In [None]:
# Informa√ß√µes sobre jobs executados
print("üîç Informa√ß√µes de Jobs do Spark")
print("=" * 35)

# Status tracker
status_tracker = sc.statusTracker()

# Informa√ß√µes dos executors
executor_infos = status_tracker.getExecutorInfos()
print(f"üìä N√∫mero de Executors: {len(executor_infos)}")

for executor in executor_infos:
    print(f"\nüîß Executor {executor.executorId}:")
    print(f"  - Host: {executor.host}")
    print(f"  - Cores: {executor.totalCores}")
    print(f"  - Mem√≥ria M√°xima: {executor.maxMemory / (1024**3):.2f} GB")
    print(f"  - Tasks Ativas: {executor.activeTasks}")
    print(f"  - Tasks Completadas: {executor.completedTasks}")

# URL da Spark UI
spark_ui_url = spark.sparkContext.uiWebUrl
if spark_ui_url:
    print(f"\nüåê Spark UI dispon√≠vel em: {spark_ui_url}")
else:
    print("\n‚ö†Ô∏è Spark UI n√£o dispon√≠vel (modo local)")

# Exemplo de explain plan
print("\nüìã Plano de Execu√ß√£o (Explain):")
print("=" * 40)
df_signals.filter(col("symbol") == "AAPL").select("date", "close", "sma_20").explain(True)

## 10. üßπ Limpeza e Finaliza√ß√£o

Vamos limpar recursos e finalizar a sess√£o Spark.

In [None]:
# Limpar cache
print("üßπ Limpando cache...")
spark.catalog.clearCache()

# Unpersist DataFrames
df.unpersist()
df_indicators.unpersist()
returns_pivot.unpersist()

print("‚úÖ Cache limpo")

# Estat√≠sticas finais
print("\nüìä RESUMO DO PROCESSAMENTO SPARK")
print("=" * 45)
print(f"üìà Registros processados: {df.count():,}")
print(f"üîß Indicadores calculados: 8 (SMA, volatilidade, sinais, etc.)")
print(f"üíæ Arquivos salvos: 6 (Parquet + CSV)")
print(f"‚ö° Executors utilizados: {len(executor_infos)}")
print(f"üéØ Performance: Otimizada com cache e particionamento")

print("\nüéØ PR√ìXIMOS PASSOS:")
print("=" * 25)
print("1. üìä An√°lise estat√≠stica avan√ßada")
print("2. ü§ñ Aplicar modelos de ML")
print("3. üí≠ An√°lise de sentimentos")
print("4. üìà Dashboards interativos")
print("5. üîÑ Pipeline automatizado")

print("\n‚ú® Processamento Spark conclu√≠do com sucesso!")

In [None]:
# Finalizar Spark Session
print("üîö Finalizando Spark Session...")
spark.stop()
print("‚úÖ Spark Session finalizada")

print("\nüéâ Notebook conclu√≠do com sucesso!")
print("üìÅ Dados processados dispon√≠veis em: ../data/processed/")
print("üîÑ Execute o pr√≥ximo notebook para an√°lises de ML")

---

## üìö Refer√™ncias e Links √öteis

- **Apache Spark**: [spark.apache.org](https://spark.apache.org/)
- **PySpark Documentation**: [spark.apache.org/docs/latest/api/python/](https://spark.apache.org/docs/latest/api/python/)
- **Spark SQL Guide**: [spark.apache.org/docs/latest/sql-programming-guide.html](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- **Performance Tuning**: [spark.apache.org/docs/latest/tuning.html](https://spark.apache.org/docs/latest/tuning.html)

---

**Desenvolvido pela Equipe Big Data Finance**  
**Notebook:** 02_spark_processing_example.ipynb  
**Vers√£o:** 1.0