<!-- Projeto Desenvolvido na Data Science Academy - www.datascienceacademy.com.br -->
# <font color='blue'>Data Science Academy</font>
## <font color='blue'>PySpark e Apache Kafka Para Processamento de Dados em Batch e Streaming</font>
## <font color='blue'>Projeto 3</font>
### <font color='blue'>Pipeline de Limpeza e Transformação Para Aplicações de IA com PySpark SQL</font>

## Pacotes Python Usados no Projeto

In [1]:
# Imports
import os
import pyspark
import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.ml.evaluation as evals
import pyspark.ml.tuning as tune
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import  VectorAssembler
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import round, desc

In [2]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Data Science Academy"

Author: Data Science Academy



## Criando a Sessão Spark e Definindo o Nível de Log

In [4]:
# Cria a sessão
spark = SparkSession.builder.appName('Projeto3-Exp').getOrCreate()

In [5]:
# Define o nível de log
spark.sparkContext.setLogLevel("ERROR")

<!-- Projeto Desenvolvido na Data Science Academy - www.datascienceacademy.com.br -->
## Carregando os Datasets a Partir do HDFS

In [6]:
# Carrega o arquivo 1
df_dsa_aeroportos = spark.read.csv("/opt/spark/data/dataset1.csv", header = True)

                                                                                

In [7]:
type(df_dsa_aeroportos)

pyspark.sql.dataframe.DataFrame

In [8]:
df_dsa_aeroportos.show(10)

+---+--------------------+----------+------------+----+---+---+
|faa|                name|       lat|         lon| alt| tz|dst|
+---+--------------------+----------+------------+----+---+---+
|04G|   Lansdowne Airport|41.1304722| -80.6195833|1044| -5|  A|
|06A|Moton Field Munic...|32.4605722| -85.6800278| 264| -5|  A|
|06C| Schaumburg Regional|41.9893408| -88.1012428| 801| -6|  A|
|06N|     Randall Airport| 41.431912| -74.3915611| 523| -5|  A|
|09J|Jekyll Island Air...|31.0744722| -81.4277778|  11| -4|  A|
|0A9|Elizabethton Muni...|36.3712222| -82.1734167|1593| -4|  A|
|0G6|Williams County A...|41.4673056| -84.5067778| 730| -5|  A|
|0G7|Finger Lakes Regi...|42.8835647| -76.7812318| 492| -5|  A|
|0P2|Shoestring Aviati...|39.7948244| -76.6471914|1000| -5|  U|
|0S9|Jefferson County ...|48.0538086|-122.8106436| 108| -8|  A|
+---+--------------------+----------+------------+----+---+---+
only showing top 10 rows



In [9]:
# Carrega o arquivo 2
df_dsa_voos = spark.read.csv("/opt/spark/data/dataset2.csv", header = True)

In [10]:
df_dsa_voos.show(10)

[Stage 3:>                                                          (0 + 1) / 1]

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

                                                                                

In [11]:
# Carrega o arquivo 3
df_dsa_aeronaves = spark.read.csv("/opt/spark/data/dataset3.csv", header = True)

In [12]:
df_dsa_aeronaves.show(10)

+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
|tailnum|year|                type|    manufacturer|   model|engines|seats|speed|   engine|
+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
| N102UW|1998|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N103US|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N104UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N105UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N107US|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N108UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N109UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N110UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA

Vamos converter esses dados para o formato:

- Dados de entrada --> ['month', 'air_time', 'carr_fact', 'dest_fact', 'plane_age'] como o vetor features.
- Dados de saída --> ['is_late'] com o nome label.

E então usaremos os dados nesse formato para treinar e avaliar dois modelos de Machine Learning. Escolheremos o melhor modelo e então criaremos o job de automação do processo de treinamento no cluster Spark.

## Exploração e Limpeza dos Dados

In [13]:
# Cria tabela temporária
df_dsa_voos.createOrReplaceTempView('voos')

Se você deseja executar consultas SQL diretamente sobre os dados, criar uma tabela temporária permite usar a sintaxe SQL para filtrar, agrupar e manipular os dados de forma que pode ser mais intuitiva ou mais fácil de expressar do que utilizando as APIs do DataFrame.

In [14]:
# Lista as tabelas
spark.catalog.listTables()

[Table(name='voos', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

In [15]:
# Consulta SQL
query = """
SELECT 
    carrier AS companhia_aerea,
    COUNT(*) AS total_voos,
    ROUND(AVG(dep_delay), 2) AS media_atraso_partida,
    ROUND(AVG(arr_delay), 2) AS media_atraso_chegada,
    MAX(dep_delay) AS maior_atraso_partida,
    MAX(arr_delay) AS maior_atraso_chegada
FROM 
    voos
WHERE 
    dep_delay > 0 OR arr_delay > 0
GROUP BY 
    carrier
ORDER BY 
    media_atraso_chegada DESC
"""

In [16]:
# Executa a consulta SQL e armazena o resultado em um DataFrame
df_result = spark.sql(query)

In [17]:
# Mostra o resultado
df_result.show()

[Stage 8:>                                                          (0 + 1) / 1]

+---------------+----------+--------------------+--------------------+--------------------+--------------------+
|companhia_aerea|total_voos|media_atraso_partida|media_atraso_chegada|maior_atraso_partida|maior_atraso_chegada|
+---------------+----------+--------------------+--------------------+--------------------+--------------------+
|             HA|        31|                14.9|               28.68|                   9|                   9|
|             VX|        63|               29.89|                28.6|                  96|                  95|
|             AA|       243|               27.61|               27.09|                  97|                  NA|
|             OO|       421|                19.8|               22.51|                  95|                  NA|
|             UA|       540|               23.71|               19.91|                  99|                  NA|
|             B6|       103|               20.21|               19.49|                  92|     

                                                                                

Criar um DataFrame diretamente a partir de outro DataFrame é mais direto e consome menos recursos do que criar uma tabela temporária intermediária. Se seu uso é apenas para operações simples ou manipulações diretas, é mais eficiente trabalhar com o DataFrame diretamente.

In [18]:
# Cria um dataframe a partir da tabela temporária
df_voos = spark.table('voos')

In [19]:
df_voos.show(10)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

In [20]:
# Cria a coluna de duração dos voos em horas (tarefa de engenharia de atributos)
df_dsa_voos = df_dsa_voos.withColumn('duration_hrs', round(df_dsa_voos.air_time / 60, 2))

In [21]:
df_dsa_voos.show(10)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|         2.2|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|         6.0|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|        1.85|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|        1.38|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS

In [22]:
# Filtro para visualizar os voos mais longos
df_voos_longos_1 = df_dsa_voos.filter('distance > 1000')

In [23]:
df_voos_longos_1.show(10)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|         6.0|
|2014|    4| 19|    1236|       -4|    1508|       -7|     AS| N309AS|   490|   SEA| SAN|     135|    1050|  12|    36|        2.25|
|2014|   11| 19|    1812|       -3|    2352|       -4|     AS| N564AS|    26|   SEA| ORD|     198|    1721|  18|    12|         3.3|
|2014|    8|  3|    1120|        0|    1415|        2|     AS| N305AS|   656|   SEA| PHX|     154|    1107|  11|    20|        2.57|
|2014|   11| 12|    2346|       -4|     217|      -28|     AS| N765AS

In [24]:
# Ordena o DataFrame pela coluna 'duration_hrs' em ordem decrescente
df_voos_longos_1_sorted = df_voos_longos_1.orderBy(desc('duration_hrs'))

In [25]:
# Exibe o resultado ordenado
df_voos_longos_1_sorted.show(10)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|2014|    3| 15|    1822|       -8|    2233|       58|     AS| N516AS|   815|   SEA| LIH|     409|    2701|  18|    22|        6.82|
|2014|    2| 13|     824|       -6|    1329|       34|     AS| N536AS|   861|   SEA| OGG|     403|    2640|   8|    24|        6.72|
|2014|   12| 23|     809|       -6|    1309|       40|     DL| N548US|  2210|   SEA| HNL|     399|    2677|   8|     9|        6.65|
|2014|    2| 13|     714|       -6|    1223|       41|     AS| N557AS|   855|   PDX| KOA|     396|    2607|   7|    14|         6.6|
|2014|    3|  6|    1815|       -5|    2310|       15|     AS| N513AS

In [26]:
# Mesma regra anterior, com sintaxe diferente
# Filtra os voos com distância maior que 1000 e ordena pela coluna 'distance' em ordem descendente
df_voos_longos_2 = df_dsa_voos.filter(df_dsa_voos.distance > 1000).orderBy(desc('duration_hrs'))

In [27]:
df_voos_longos_2.show(10)

[Stage 15:>                                                         (0 + 1) / 1]

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|2014|    3| 15|    1822|       -8|    2233|       58|     AS| N516AS|   815|   SEA| LIH|     409|    2701|  18|    22|        6.82|
|2014|    2| 13|     824|       -6|    1329|       34|     AS| N536AS|   861|   SEA| OGG|     403|    2640|   8|    24|        6.72|
|2014|   12| 23|     809|       -6|    1309|       40|     DL| N548US|  2210|   SEA| HNL|     399|    2677|   8|     9|        6.65|
|2014|    2| 13|     714|       -6|    1223|       41|     AS| N557AS|   855|   PDX| KOA|     396|    2607|   7|    14|         6.6|
|2014|    3|  6|    1815|       -5|    2310|       15|     AS| N513AS

                                                                                

In [28]:
# Selecionando 3 colunas
selected_1 = df_dsa_voos.select('tailnum', 'origin', 'dest')

In [29]:
# Select de 3 colunas com outra sintaxe
temp = df_dsa_voos.select(df_dsa_voos.origin, df_dsa_voos.dest, df_dsa_voos.carrier)

In [30]:
# Criando 2 filtros
FilterA = df_dsa_voos.origin == 'SEA'
FilterB = df_dsa_voos.dest == 'PDX'

In [31]:
# Aplicando a função Filter com os filtros criados
selected_2 = temp.filter(FilterA).filter(FilterB)

In [32]:
selected_2.show()

+------+----+-------+
|origin|dest|carrier|
+------+----+-------+
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     AS|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     AS|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
|   SEA| PDX|     OO|
+------+----+-------+
only showing top 20 rows



In [33]:
# Calculando a velocidade média dos voos
avg_speed = (df_dsa_voos.distance / (df_dsa_voos.air_time / 60)).alias("avg_speed")

In [34]:
df_dsa_voos.show(5)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|         2.2|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|         6.0|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|        1.85|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|        1.38|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS

In [35]:
# Adicionando a nova variável ao select
speed_1 = df_dsa_voos.select('origin', 'dest', 'tailnum', avg_speed)

In [36]:
speed_1.show()

+------+----+-------+------------------+
|origin|dest|tailnum|         avg_speed|
+------+----+-------+------------------+
|   SEA| LAX| N846VA| 433.6363636363636|
|   SEA| HNL| N559AS| 446.1666666666667|
|   SEA| SFO| N847VA|367.02702702702703|
|   PDX| SJC| N360SW| 411.3253012048193|
|   SEA| BUR| N612AS| 442.6771653543307|
|   PDX| DEN| N646SW|491.40495867768595|
|   PDX| OAK| N422WN|             362.0|
|   SEA| SFO| N361VA| 415.7142857142857|
|   SEA| SAN| N309AS| 466.6666666666667|
|   SEA| ORD| N564AS| 521.5151515151515|
|   SEA| LAX| N323AS| 440.3076923076923|
|   SEA| PHX| N305AS|431.29870129870125|
|   SEA| LAS| N433AS| 409.6062992125984|
|   SEA| ANC| N765AS|474.75409836065575|
|   SEA| SFO| N713AS| 315.8139534883721|
|   PDX| SFO| N27205| 366.6666666666667|
|   SEA| SMF| N626AS|477.63157894736844|
|   SEA| MDW| N8634A|481.38888888888886|
|   SEA| BOS| N597AS| 516.4137931034483|
|   PDX| BUR| N215AG| 441.6216216216216|
+------+----+-------+------------------+
only showing top

In [37]:
# Fazendo o cálculo direto no select
speed_2 = df_dsa_voos.selectExpr('origin', 'dest', 'tailnum', 'round(distance/(air_time/60), 2) as avg_speed')

In [38]:
speed_2.show()

+------+----+-------+---------+
|origin|dest|tailnum|avg_speed|
+------+----+-------+---------+
|   SEA| LAX| N846VA|   433.64|
|   SEA| HNL| N559AS|   446.17|
|   SEA| SFO| N847VA|   367.03|
|   PDX| SJC| N360SW|   411.33|
|   SEA| BUR| N612AS|   442.68|
|   PDX| DEN| N646SW|    491.4|
|   PDX| OAK| N422WN|    362.0|
|   SEA| SFO| N361VA|   415.71|
|   SEA| SAN| N309AS|   466.67|
|   SEA| ORD| N564AS|   521.52|
|   SEA| LAX| N323AS|   440.31|
|   SEA| PHX| N305AS|    431.3|
|   SEA| LAS| N433AS|   409.61|
|   SEA| ANC| N765AS|   474.75|
|   SEA| SFO| N713AS|   315.81|
|   PDX| SFO| N27205|   366.67|
|   SEA| SMF| N626AS|   477.63|
|   SEA| MDW| N8634A|   481.39|
|   SEA| BOS| N597AS|   516.41|
|   PDX| BUR| N215AG|   441.62|
+------+----+-------+---------+
only showing top 20 rows



In [39]:
# Resumo de 2 variáveis
df_dsa_voos.describe('air_time', 'distance').show()

[Stage 20:>                                                         (0 + 1) / 1]

+-------+------------------+-----------------+
|summary|          air_time|         distance|
+-------+------------------+-----------------+
|  count|             10000|            10000|
|   mean|152.88423173803525|        1208.1516|
| stddev|  72.8656286392139|656.8599023464376|
|    min|               100|             1009|
|    max|                NA|              991|
+-------+------------------+-----------------+



                                                                                

In [40]:
# Mostra o tipo de dados de cada coluna
df_dsa_voos.dtypes

[('year', 'string'),
 ('month', 'string'),
 ('day', 'string'),
 ('dep_time', 'string'),
 ('dep_delay', 'string'),
 ('arr_time', 'string'),
 ('arr_delay', 'string'),
 ('carrier', 'string'),
 ('tailnum', 'string'),
 ('flight', 'string'),
 ('origin', 'string'),
 ('dest', 'string'),
 ('air_time', 'string'),
 ('distance', 'string'),
 ('hour', 'string'),
 ('minute', 'string'),
 ('duration_hrs', 'double')]

In [41]:
# Ajustando o tipo de dado de duas colunas
df_dsa_voos = df_dsa_voos.withColumn('distance', df_dsa_voos.distance.cast('float'))
df_dsa_voos = df_dsa_voos.withColumn('air_time', df_dsa_voos.air_time.cast('float'))

In [42]:
# Mostra o tipo de dados de cada coluna
df_dsa_voos.dtypes

[('year', 'string'),
 ('month', 'string'),
 ('day', 'string'),
 ('dep_time', 'string'),
 ('dep_delay', 'string'),
 ('arr_time', 'string'),
 ('arr_delay', 'string'),
 ('carrier', 'string'),
 ('tailnum', 'string'),
 ('flight', 'string'),
 ('origin', 'string'),
 ('dest', 'string'),
 ('air_time', 'float'),
 ('distance', 'float'),
 ('hour', 'string'),
 ('minute', 'string'),
 ('duration_hrs', 'double')]

In [43]:
# Resumo de 2 variáveis
df_dsa_voos.describe('air_time', 'distance').show()

+-------+------------------+-----------------+
|summary|          air_time|         distance|
+-------+------------------+-----------------+
|  count|              9925|            10000|
|   mean|152.88423173803525|        1208.1516|
| stddev|  72.8656286392139|656.8599023464376|
|    min|              20.0|             93.0|
|    max|             409.0|           2724.0|
+-------+------------------+-----------------+



In [44]:
# Agrupamento por aeronave
by_plane = df_dsa_voos.groupBy('tailnum')

In [45]:
# Contagem
by_plane.count().show()

[Stage 26:>                                                         (0 + 1) / 1]

+-------+-----+
|tailnum|count|
+-------+-----+
| N442AS|   38|
| N102UW|    2|
| N36472|    4|
| N38451|    4|
| N73283|    4|
| N513UA|    2|
| N954WN|    5|
| N388DA|    3|
| N567AA|    1|
| N516UA|    2|
| N927DN|    1|
| N8322X|    1|
| N466SW|    1|
|  N6700|    1|
| N607AS|   45|
| N622SW|    4|
| N584AS|   31|
| N914WN|    4|
| N654AW|    2|
| N336NW|    1|
+-------+-----+
only showing top 20 rows



                                                                                

In [46]:
# Agrupamento por origem do voo
by_origin = df_dsa_voos.groupBy('origin')

In [47]:
# Média de tempo no ar por origem do voo
by_origin.avg('air_time').show()

+------+------------------+
|origin|     avg(air_time)|
+------+------------------+
|   SEA| 160.4361496051259|
|   PDX|137.11543248288737|
+------+------------------+



In [48]:
# Resumo 
df_dsa_voos.describe('dep_delay').show()

+-------+------------------+
|summary|         dep_delay|
+-------+------------------+
|  count|             10000|
|   mean| 6.068629421221865|
| stddev|28.808608062751805|
|    min|                -1|
|    max|                NA|
+-------+------------------+



In [49]:
# Ajustando o tipo de dado
df_dsa_voos = df_dsa_voos.withColumn('dep_delay', df_dsa_voos.dep_delay.cast('float'))

In [50]:
# Resumo 
df_dsa_voos.describe('dep_delay').show()

+-------+------------------+
|summary|         dep_delay|
+-------+------------------+
|  count|              9952|
|   mean| 6.068629421221865|
| stddev|28.808608062751805|
|    min|             -19.0|
|    max|             886.0|
+-------+------------------+



In [51]:
# Agrupamento por mês e destino do voo
by_month_dest = df_dsa_voos.groupBy('month', 'dest')

In [52]:
# Calculando a média
by_month_dest.avg('dep_delay').show()

+-----+----+--------------------+
|month|dest|      avg(dep_delay)|
+-----+----+--------------------+
|   11| TUS| -2.3333333333333335|
|   11| ANC|   7.529411764705882|
|    1| BUR|               -1.45|
|    1| PDX| -5.6923076923076925|
|    6| SBA|                -2.5|
|    5| LAX|-0.15789473684210525|
|   10| DTW|                 2.6|
|    6| SIT|                -1.0|
|   10| DFW|  18.176470588235293|
|    3| FAI|                -2.2|
|   10| SEA|                -0.8|
|    2| TUS| -0.6666666666666666|
|   12| OGG|  25.181818181818183|
|    9| DFW|   4.066666666666666|
|    5| EWR|               14.25|
|    3| RDM|                -6.2|
|    8| DCA|                 2.6|
|    7| ATL|   4.675675675675675|
|    4| JFK| 0.07142857142857142|
|   10| SNA| -1.1333333333333333|
+-----+----+--------------------+
only showing top 20 rows



In [53]:
df_dsa_aeroportos.show()

+---+--------------------+----------------+-----------------+----+---+---+
|faa|                name|             lat|              lon| alt| tz|dst|
+---+--------------------+----------------+-----------------+----+---+---+
|04G|   Lansdowne Airport|      41.1304722|      -80.6195833|1044| -5|  A|
|06A|Moton Field Munic...|      32.4605722|      -85.6800278| 264| -5|  A|
|06C| Schaumburg Regional|      41.9893408|      -88.1012428| 801| -6|  A|
|06N|     Randall Airport|       41.431912|      -74.3915611| 523| -5|  A|
|09J|Jekyll Island Air...|      31.0744722|      -81.4277778|  11| -4|  A|
|0A9|Elizabethton Muni...|      36.3712222|      -82.1734167|1593| -4|  A|
|0G6|Williams County A...|      41.4673056|      -84.5067778| 730| -5|  A|
|0G7|Finger Lakes Regi...|      42.8835647|      -76.7812318| 492| -5|  A|
|0P2|Shoestring Aviati...|      39.7948244|      -76.6471914|1000| -5|  U|
|0S9|Jefferson County ...|      48.0538086|     -122.8106436| 108| -8|  A|
|0W3|Harford County Ai...

In [54]:
# Ajusta o título da coluna
df_dsa_aeroportos = df_dsa_aeroportos.withColumnRenamed('faa', 'dest')

In [55]:
df_dsa_aeronaves.show(5)

+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
|tailnum|year|                type|    manufacturer|   model|engines|seats|speed|   engine|
+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
| N102UW|1998|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N103US|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N104UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N105UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N107US|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
only showing top 5 rows



In [56]:
# Ajusta o título da coluna
df_dsa_aeronaves = df_dsa_aeronaves.withColumnRenamed('year', 'plane_year')

<!-- Projeto Desenvolvido na Data Science Academy - www.datascienceacademy.com.br -->
## Concatenando os Datasets e Preparando o Dataset Final

In [57]:
df_dsa_aeroportos.show(3)

+----+--------------------+----------+-----------+----+---+---+
|dest|                name|       lat|        lon| alt| tz|dst|
+----+--------------------+----------+-----------+----+---+---+
| 04G|   Lansdowne Airport|41.1304722|-80.6195833|1044| -5|  A|
| 06A|Moton Field Munic...|32.4605722|-85.6800278| 264| -5|  A|
| 06C| Schaumburg Regional|41.9893408|-88.1012428| 801| -6|  A|
+----+--------------------+----------+-----------+----+---+---+
only showing top 3 rows



In [58]:
df_dsa_voos.show(3)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
|2014|   12|  8|     658|     -7.0|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|   132.0|   954.0|   6|    58|         2.2|
|2014|    1| 22|    1040|      5.0|    1505|        5|     AS| N559AS|   851|   SEA| HNL|   360.0|  2677.0|  10|    40|         6.0|
|2014|    3|  9|    1443|     -2.0|    1652|        2|     VX| N847VA|   755|   SEA| SFO|   111.0|   679.0|  14|    43|        1.85|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------+
only showing top 3 rows



In [59]:
df_dsa_aeronaves.show(3)

+-------+----------+--------------------+----------------+--------+-------+-----+-----+---------+
|tailnum|plane_year|                type|    manufacturer|   model|engines|seats|speed|   engine|
+-------+----------+--------------------+----------------+--------+-------+-----+-----+---------+
| N102UW|      1998|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N103US|      1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N104UW|      1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
+-------+----------+--------------------+----------------+--------+-------+-----+-----+---------+
only showing top 3 rows



In [60]:
# Concatena 2 datasets
df_dsa_voos_aeroportos = df_dsa_voos.join(df_dsa_aeroportos, on='dest', how='leftouter')

**on='dest':** Este parâmetro especifica que o join deve ser feito com base na coluna dest (destino) presente nos dois DataFrames. Neste caso, está associando os dados de voos (df_dsa_voos) com os dados dos aeroportos (df_dsa_aeroportos) com base na coluna dest, que representa o aeroporto de destino dos voos.

**how='leftouter':** Especifica o tipo de join. O 'leftouter' (ou LEFT OUTER JOIN) significa que todos os registros do DataFrame à esquerda (df_dsa_voos) serão mantidos, e os registros correspondentes do DataFrame à direita (df_dsa_aeroportos) serão adicionados.
Se não houver correspondência no DataFrame da direita (df_dsa_aeroportos), os valores das colunas do DataFrame direito serão null.

In [61]:
df_dsa_voos_aeroportos.show(5)

+----+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+--------+--------+----+------+------------+--------------------+---------+-----------+---+---+---+
|dest|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|air_time|distance|hour|minute|duration_hrs|                name|      lat|        lon|alt| tz|dst|
+----+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+--------+--------+----+------+------------+--------------------+---------+-----------+---+---+---+
| LAX|2014|   12|  8|     658|     -7.0|     935|       -5|     VX| N846VA|  1780|   SEA|   132.0|   954.0|   6|    58|         2.2|    Los Angeles Intl|33.942536|-118.408075|126| -8|  A|
| HNL|2014|    1| 22|    1040|      5.0|    1505|        5|     AS| N559AS|   851|   SEA|   360.0|  2677.0|  10|    40|         6.0|       Honolulu Intl|21.318681|-157.922428| 13|-10|  N|
| SFO|2014|    3|  9|    1443|     -2.0|    1652|        2| 

In [62]:
# Concatena 2 datasets
df_dsa_final = df_dsa_voos_aeroportos.join(df_dsa_aeronaves, on='tailnum', how='leftouter')

In [63]:
df_dsa_final.show(10)

+-------+----+----+-----+---+--------+---------+--------+---------+-------+------+------+--------+--------+----+------+------------+--------------------+---------+-----------+----+---+---+----------+--------------------+------------+--------+-------+-----+-----+---------+
|tailnum|dest|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|air_time|distance|hour|minute|duration_hrs|                name|      lat|        lon| alt| tz|dst|plane_year|                type|manufacturer|   model|engines|seats|speed|   engine|
+-------+----+----+-----+---+--------+---------+--------+---------+-------+------+------+--------+--------+----+------+------------+--------------------+---------+-----------+----+---+---+----------+--------------------+------------+--------+-------+-----+-----+---------+
| N846VA| LAX|2014|   12|  8|     658|     -7.0|     935|       -5|     VX|  1780|   SEA|   132.0|   954.0|   6|    58|         2.2|    Los Angeles Intl|33.942536|-118.408075| 126| 

In [64]:
df_dsa_final.dtypes

[('tailnum', 'string'),
 ('dest', 'string'),
 ('year', 'string'),
 ('month', 'string'),
 ('day', 'string'),
 ('dep_time', 'string'),
 ('dep_delay', 'float'),
 ('arr_time', 'string'),
 ('arr_delay', 'string'),
 ('carrier', 'string'),
 ('flight', 'string'),
 ('origin', 'string'),
 ('air_time', 'float'),
 ('distance', 'float'),
 ('hour', 'string'),
 ('minute', 'string'),
 ('duration_hrs', 'double'),
 ('name', 'string'),
 ('lat', 'string'),
 ('lon', 'string'),
 ('alt', 'string'),
 ('tz', 'string'),
 ('dst', 'string'),
 ('plane_year', 'string'),
 ('type', 'string'),
 ('manufacturer', 'string'),
 ('model', 'string'),
 ('engines', 'string'),
 ('seats', 'string'),
 ('speed', 'string'),
 ('engine', 'string')]

In [65]:
# Ajusta o tipo de dado
df_dsa_final = df_dsa_final.withColumn('month', df_dsa_final.month.cast('integer'))
df_dsa_final = df_dsa_final.withColumn('air_time' , df_dsa_final.air_time.cast('integer'))
df_dsa_final = df_dsa_final.withColumn('arr_delay', df_dsa_final.arr_delay.cast('integer'))
df_dsa_final = df_dsa_final.withColumn('plane_year', df_dsa_final.plane_year.cast('integer'))

## Engenharia de Atributos

In [66]:
df_dsa_final.describe('month', 'air_time', 'arr_delay', 'plane_year').show()

+-------+------------------+------------------+------------------+-----------------+
|summary|             month|          air_time|         arr_delay|       plane_year|
+-------+------------------+------------------+------------------+-----------------+
|  count|             10000|              9925|              9925|             9354|
|   mean|            6.6438|152.88423173803525|2.2530982367758186|2001.594398118452|
| stddev|3.3191600205962097|  72.8656286392139|31.074918600451877|58.92921992728455|
|    min|                 1|                20|               -58|                0|
|    max|                12|               409|               900|             2014|
+-------+------------------+------------------+------------------+-----------------+



In [67]:
# Cria uma variável com a idade do avião
df_dsa_final = df_dsa_final.withColumn('plane_age', df_dsa_final.year - df_dsa_final.plane_year)

In [68]:
df_dsa_final.select('month', 'air_time', 'arr_delay', 'plane_age').show(10)

+-----+--------+---------+---------+
|month|air_time|arr_delay|plane_age|
+-----+--------+---------+---------+
|   12|     132|       -5|      3.0|
|    1|     360|        5|      8.0|
|    3|     111|        2|      3.0|
|    4|      83|       34|     22.0|
|    3|     127|        1|     15.0|
|    1|     121|        2|     17.0|
|    7|      90|       51|     12.0|
|    5|      98|      -18|      1.0|
|    4|     135|       -7|     13.0|
|   11|     198|       -4|      8.0|
+-----+--------+---------+---------+
only showing top 10 rows



In [69]:
# Cria a variável "is_late" somente para os casos onde o atraso na chegada foi maior do que zero
df_dsa_final = df_dsa_final.withColumn('is_late', df_dsa_final.arr_delay > 0)

In [70]:
df_dsa_final.select('month', 'air_time', 'arr_delay', 'plane_age', 'is_late').show(10)

+-----+--------+---------+---------+-------+
|month|air_time|arr_delay|plane_age|is_late|
+-----+--------+---------+---------+-------+
|   12|     132|       -5|      3.0|  false|
|    1|     360|        5|      8.0|   true|
|    3|     111|        2|      3.0|   true|
|    4|      83|       34|     22.0|   true|
|    3|     127|        1|     15.0|   true|
|    1|     121|        2|     17.0|   true|
|    7|      90|       51|     12.0|   true|
|    5|      98|      -18|      1.0|  false|
|    4|     135|       -7|     13.0|  false|
|   11|     198|       -4|      8.0|  false|
+-----+--------+---------+---------+-------+
only showing top 10 rows



In [71]:
# A variável alvo (label) será "is_late", ou seja, se o voo vai atrasar ou não
# Observe que o nome da coluna precisa ser "label" pois isso que o Spark espera como nome da coluna alvo
df_dsa_final = df_dsa_final.withColumn('label', df_dsa_final.is_late.cast('integer'))

In [72]:
df_dsa_final.select('month', 'air_time', 'arr_delay', 'plane_age', 'is_late', 'label').show(10)

+-----+--------+---------+---------+-------+-----+
|month|air_time|arr_delay|plane_age|is_late|label|
+-----+--------+---------+---------+-------+-----+
|   12|     132|       -5|      3.0|  false|    0|
|    1|     360|        5|      8.0|   true|    1|
|    3|     111|        2|      3.0|   true|    1|
|    4|      83|       34|     22.0|   true|    1|
|    3|     127|        1|     15.0|   true|    1|
|    1|     121|        2|     17.0|   true|    1|
|    7|      90|       51|     12.0|   true|    1|
|    5|      98|      -18|      1.0|  false|    0|
|    4|     135|       -7|     13.0|  false|    0|
|   11|     198|       -4|      8.0|  false|    0|
+-----+--------+---------+---------+-------+-----+
only showing top 10 rows



## Pré-Processamento com String Indexer e One Hot Encoder

In [73]:
df_dsa_final.select('carrier', 'dest').show(10)

+-------+----+
|carrier|dest|
+-------+----+
|     VX| LAX|
|     AS| HNL|
|     VX| SFO|
|     WN| SJC|
|     AS| BUR|
|     WN| DEN|
|     WN| OAK|
|     VX| SFO|
|     AS| SAN|
|     AS| ORD|
+-------+----+
only showing top 10 rows



In [74]:
# Cria os indexadores StringIndexer
carr_indexer = StringIndexer(inputCol='carrier', outputCol='carrier_index')
dest_indexer = StringIndexer(inputCol='dest', outputCol='dest_index')

In [75]:
# Cria os codificadores OneHotEncoder
carr_encoder = OneHotEncoder(inputCol='carrier_index', outputCol='carr_fact')
dest_encoder = OneHotEncoder(inputCol='dest_index', outputCol='dest_fact')

In [76]:
# Cria o vector assembler apenas fazendo skip para qualquer registro inválido
# As variáveis de entrada estarão no vetor chamado features (tem que ser esse nome, o que é requerido pelo Spark)
vec_assembler = VectorAssembler(inputCols = ['month', 'air_time', 'carr_fact', 'dest_fact', 'plane_age'],
                                outputCol = 'features',
                                handleInvalid = "skip")

- Dados de entrada --> ['month', 'air_time', 'carr_fact', 'dest_fact', 'plane_age'] como o vetor features.
- Dados de saída --> ['is_late'] com o nome label.

## Criando o Pipeline de Pré-Processamento

In [77]:
# Cria o pipeline de transformação e pré-processamento
dsa_pipe = Pipeline(stages = [dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler])

In [78]:
# Treina e aplica o pipeline
piped_data = dsa_pipe.fit(df_dsa_final).transform(df_dsa_final)

                                                                                

In [79]:
piped_data.show(5)

+-------+----+----+-----+---+--------+---------+--------+---------+-------+------+------+--------+--------+----+------+------------+--------------------+---------+-----------+---+---+---+----------+--------------------+------------+--------+-------+-----+-----+---------+---------+-------+-----+----------+---------------+-------------+--------------+--------------------+
|tailnum|dest|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|air_time|distance|hour|minute|duration_hrs|                name|      lat|        lon|alt| tz|dst|plane_year|                type|manufacturer|   model|engines|seats|speed|   engine|plane_age|is_late|label|dest_index|      dest_fact|carrier_index|     carr_fact|            features|
+-------+----+----+-----+---+--------+---------+--------+---------+-------+------+------+--------+--------+----+------+------------+--------------------+---------+-----------+---+---+---+----------+--------------------+------------+--------+-------+-----

                                                                                

In [80]:
# Dados de entrada e saída
piped_data.select("features", "label").show(truncate = False)

+--------------------------------------------+-----+
|features                                    |label|
+--------------------------------------------+-----+
|(81,[0,1,10,13,80],[12.0,132.0,1.0,1.0,3.0])|0    |
|(81,[0,1,2,34,80],[1.0,360.0,1.0,1.0,8.0])  |1    |
|(81,[0,1,10,12,80],[3.0,111.0,1.0,1.0,3.0]) |1    |
|(81,[0,1,3,21,80],[4.0,83.0,1.0,1.0,22.0])  |1    |
|(81,[0,1,2,35,80],[3.0,127.0,1.0,1.0,15.0]) |1    |
|(81,[0,1,3,14,80],[1.0,121.0,1.0,1.0,17.0]) |1    |
|(81,[0,1,3,22,80],[7.0,90.0,1.0,1.0,12.0])  |1    |
|(81,[0,1,10,12,80],[5.0,98.0,1.0,1.0,1.0])  |0    |
|(81,[0,1,2,24,80],[4.0,135.0,1.0,1.0,13.0]) |0    |
|(81,[0,1,2,18,80],[11.0,198.0,1.0,1.0,8.0]) |0    |
|(81,[0,1,2,13,80],[11.0,130.0,1.0,1.0,10.0])|0    |
|(81,[0,1,2,15,80],[8.0,154.0,1.0,1.0,13.0]) |1    |
|(81,[0,1,2,16,80],[10.0,127.0,1.0,1.0,1.0]) |1    |
|(81,[0,1,2,17,80],[11.0,183.0,1.0,1.0,22.0])|0    |
|(81,[0,1,2,12,80],[10.0,129.0,1.0,1.0,15.0])|1    |
|(81,[0,1,6,12,80],[1.0,90.0,1.0,1.0,14.0])  |

                                                                                

In [81]:
# Divide os dados em treino e teste com proporção 70/30
dados_treino, dados_teste = piped_data.randomSplit([.7, .3])

## Ajustando o Número de Partições

Se as partições dos dados forem muito grandes, o Spark pode ter dificuldades para distribuí-las eficientemente. Ajustar o tamanho das partições pode ajudar no processamento. Você pode utilizar o método repartition para dividir os dados em mais ou menos partições.

In [82]:
# Escolha um número de partições adequado ao tamanho do seu cluster e dos seus dados
dados_treino = dados_treino.repartition(10)  

## Pipeline de Treinamento do Modelo de IA com PySpark em Ambiente Distribuído

## Versão 1 do Modelo

In [83]:
%%time

# Inicializa o modelo RandomForest
modelo_dsa_rf = RandomForestClassifier()

# Avaliador para medir a métrica "areaUnderROC"
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC')

# Cria o grid de parâmetros
grid = ParamGridBuilder()

# Adiciona os hiperparâmetros ao grid
grid = grid.addGrid(modelo_dsa_rf.numTrees, [10, 50, 100])
grid = grid.addGrid(modelo_dsa_rf.maxDepth, [5, 10, 20])

# Constrói o grid
grid = grid.build()

# Cria o CrossValidator
cv = CrossValidator(estimator = modelo_dsa_rf,
                    estimatorParamMaps = grid,
                    evaluator = evaluator)

# Treina os modelos com validação cruzada
modelos = cv.fit(dados_treino)

# Extrai o melhor modelo
best_rf = modelos.bestModel

# Usa o modelo para prever o conjunto de teste
test_results_rf = best_rf.transform(dados_teste)

# Avalia as previsões
print(evaluator.evaluate(test_results_rf))

[Stage 1877:>                                                       (0 + 1) / 1]

0.6342168404624947
CPU times: user 3.04 s, sys: 1.83 s, total: 4.87 s
Wall time: 8min 26s


                                                                                

## Versão 2 do Modelo

In [84]:
%%time

# Incializa o modelo de Regressão Logística
modelo_dsa_rl = LogisticRegression()

# Avaliador para medir a métrica "areaUnderROC"
evaluator = evals.BinaryClassificationEvaluator(metricName = 'areaUnderROC')

# Cria o grid de parâmetros
grid = tune.ParamGridBuilder()

# Adiciona os hiperparâmetros ao grid
grid = grid.addGrid(modelo_dsa_rl.regParam, np.arange(0, .1, .01))
grid = grid.addGrid(modelo_dsa_rl.elasticNetParam, [0,1])

# Constrói o grid
grid = grid.build()

# Cria o CrossValidator
cv = tune.CrossValidator(estimator = modelo_dsa_rl,
                         estimatorParamMaps = grid,
                         evaluator = evaluator)

# Treina os modelos com validação cruzada
modelos = cv.fit(dados_treino)

# Extrai o melhor modelo
best_lr = modelos.bestModel

# Usa o modelo para prever o conjunto de teste
test_results_rl = best_lr.transform(dados_teste)

# Avalia as previsões
print(evaluator.evaluate(test_results_rl))

                                                                                

0.6953456784942735
CPU times: user 3.22 s, sys: 2.63 s, total: 5.85 s
Wall time: 5min 8s


## Salvando Dados e Modelo em Formato Parquet

In [85]:
# Salva o DataFrame de treino em formato Parquet
dados_treino.write.mode('overwrite').parquet('/opt/spark/data/dados_treino.parquet')

                                                                                

In [86]:
# Salva o DataFrame de teste em formato Parquet
dados_teste.write.mode('overwrite').parquet('/opt/spark/data/dados_teste.parquet')

                                                                                

In [87]:
# Salva o melhor modelo no disco
best_lr.write().overwrite().save('/opt/spark/data/dsa_melhor_modelo_lr')

                                                                                

In [88]:
# Verifica se os dados foram salvos no HDFS
!hdfs dfs -ls /opt/spark/data/ | awk '{print $1, $2, $3, $4, $8}'

Found 6 items  
drwxr-xr-x - root supergroup /opt/spark/data/dados_teste.parquet
drwxr-xr-x - root supergroup /opt/spark/data/dados_treino.parquet
-rw-r--r-- 2 root supergroup /opt/spark/data/dataset1.csv
-rw-r--r-- 2 root supergroup /opt/spark/data/dataset2.csv
-rw-r--r-- 2 root supergroup /opt/spark/data/dataset3.csv
drwxr-xr-x - root supergroup /opt/spark/data/dsa_melhor_modelo_lr


In [89]:
%reload_ext watermark
%watermark -a "Data Science Academy"

Author: Data Science Academy



# Fim