<a href="https://colab.research.google.com/github/LucasJFaust/spark_projects/blob/main/sparkui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SparkUI
É uma ferramenta fornecida pelo Spark onde podemos ver em tempo real como está o processamento dos nossos dados. Conseguimos verificar:

- Jobs: É o objetivo final que queremos com o nosso código
- Stage: São varios conjuntos de taks que estão sendo rodadas em paralelo
- Tasks: São os processamentos que podem ser rodados em paralelo

Vamos instalar o ngrok, que é uma ferramente que nos permite acessar o SparkUI através do google colab.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=834e53c983e11d3ce1d36903dc91d5c9fc22a5e2f6cf2fd7e28970971ca9f065
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
!wget -qnc https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -n -q ngrok-stable-linux-amd64.zip

In [3]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
      .config('spark.ui.port', '4050')
      .appName("SparkUI Introdução")
      .getOrCreate()
)

In [4]:
!./ngrok authtoken 2fQl2pcXUL5uSLm773AvMAigDtb_7RR7VwxVaUERSnbAz3U1k
get_ipython().system_raw('./ngrok http 4050 &')
!sleep 10
!curl -s http://localhost:4040/api/tunnels | grep -Po 'public_url":"(?=https)\K[^"]*'

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [None]:
import requests

response = requests.get('http://localhost:4040/api/tunnels')
data = response.json()
public_url = data['tunnels'][0]['public_url']
print(public_url)

In [None]:
# {
#     "id_transacao": 1000,
#     "valor": "58931.97",
#     "remetente": {"nome": "Jonathan Gonsalves", "banco": "BTG", "tipo": "PF"},
#     "destinatario": {"nome": "Emanuella Moura", "banco": "Itau", "tipo": "PJ"},
#     "transaction_date": "2021-06-02",
#     "chave_pix": "aleatoria",
#     "fraude": "1"
# }

In [6]:
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType, TimestampType

schema_remetente_destinatario = StructType([
    StructField('nome', StringType()),
    StructField('banco', StringType()),
    StructField('tipo', StringType()),
])


schema_base_pix = StructType([
    StructField('id_transacao', IntegerType()),
    StructField('valor', DoubleType()),
    StructField('remetente', schema_remetente_destinatario),
    StructField('destinatario', schema_remetente_destinatario),
    StructField('transaction_date', TimestampType()),
    StructField('chave_pix', StringType()),
    StructField('fraude', IntegerType())
])


caminho_json = '/content/pix_transactions.json'

df = spark.read.json(
    caminho_json,
    schema=schema_base_pix,
    timestampFormat="yyyy-MM-dd"
)

In [7]:
df.show()

+------------+--------+--------------------+--------------------+-------------------+---------+------+
|id_transacao|   valor|           remetente|        destinatario|   transaction_date|chave_pix|fraude|
+------------+--------+--------------------+--------------------+-------------------+---------+------+
|        1000|    7.05|{Jonathan Gonsalv...|{Gabriel Cunha, I...|2022-03-19 00:00:00|      cpf|     0|
|        1001|   37.28|{Jonathan Gonsalv...|{Diego Souza, XP,...|2021-01-26 00:00:00|aleatoria|     0|
|        1002|  282.73|{Jonathan Gonsalv...|{Nicole Nunes, BT...|2022-05-31 00:00:00|aleatoria|     0|
|        1003| 8447.92|{Jonathan Gonsalv...|{Maria Fernanda C...|2022-07-04 00:00:00|aleatoria|     0|
|        1004|   58.51|{Jonathan Gonsalv...|{Isabel Silva, C6...|2021-09-11 00:00:00|aleatoria|     0|
|        1005| 6655.12|{Jonathan Gonsalv...|{Anthony Carvalho...|2022-02-11 00:00:00|  celular|     0|
|        1006| 9912.25|{Jonathan Gonsalv...|{Eloah Monteiro, ...|2022-05-

In [10]:
df.select('destinatario.nome').show()

+--------------------+
|                nome|
+--------------------+
|       Gabriel Cunha|
|         Diego Souza|
|        Nicole Nunes|
|Maria Fernanda Ca...|
|        Isabel Silva|
|    Anthony Carvalho|
|      Eloah Monteiro|
|        Sophie Rocha|
|      Pietro Ribeiro|
|      Eloah Teixeira|
|     Emanuella Sales|
|    Valentina Campos|
|       Stella Araujo|
|     Benicio Costela|
|      Joao Fernandes|
|   Gabriela da Rocha|
|      Larissa Aragao|
|           Theo Dias|
|        Danilo Jesus|
|       Bruno Correia|
+--------------------+
only showing top 20 rows



In [12]:
df.write.mode('overwrite').partitionBy('chave_pix').parquet('outpute/pix')