# PySpark - Window Functions

<b>Window ranking Function</b>

<ul>
  <li>Window Function 1 - Número de linhas - row_number()</li>
  <li>Window Function 2 - Ranking 1 - rank()</li>
  <li>Window Function 3 - Ranking 2 - dense_rank()</li>
  <li>Window Function 4 - Porcentagem ranking - percent_rank()</li>
  <li>Window Function 5 - Divisão em "N" partes - ntile()</li>
<ul>

<b>Window Analytic Functions (Funções análiticas)</b>

<ul>
  <li>Window Function 6 - LAG / Degrau - lag()</li>
  <li>Window Function 7 - Lead  Degrau - lead()</li>
  <li>Agregações</li>
  <li>GropuBy + AGG 1</li>
  <li>Where</li>
  <li>Describe</li>
  <li>Window Function 8 - Função de agregação usando window Function</li>
<ul>

#### Importação das bibliotecas  / funções

In [None]:
!pip install pyspark

In [38]:
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

# Resolve o problema de incompatibilidade de versões
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

####Criar / Iniciar Sessão PySpark

In [13]:
spark = (
    SparkSession.builder
    .master('local')
    .appName('PySpark_02')
    .getOrCreate()
)

Criar DF / Ler arquivo

In [28]:
df = spark.read.csv('/content/sample_data/wc2018-players.csv', header=True, inferSchema=True)

In [29]:
df.show(5)

+---------+---+----+------------------+----------+----------+--------------------+------+------+
|     Team|  #|Pos.| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
|Argentina|  3|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|
|Argentina| 22|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|
|Argentina| 15|  MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|
|Argentina| 18|  DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|
|Argentina| 10|  FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
only showing top 5 rows



#### Alterando as colunas

In [30]:
df = df.withColumnRenamed('Team', 'Selecao').withColumnRenamed('#', 'Numero').withColumnRenamed('Pos.', 'Posicao')\
.withColumnRenamed('Fifa Popular Name', 'Nome_FIFA').withColumnRenamed('Birth Date', 'Nascimento')\
.withColumnRenamed('Shirt Name', 'Nome Camiseta').withColumnRenamed('Club', 'Time').withColumnRenamed('Height', 'Altura')\
.withColumnRenamed('Weight', 'Peso')

In [31]:
dia = udf(lambda data: data.split('.')[0])
mes = udf(lambda data: data.split('.')[1])
ano = udf(lambda data: data.split('.')[2])

In [32]:
df = df.withColumn('Dia', dia('Nascimento')).withColumn('Mes', mes('Nascimento')).withColumn('Ano', ano('nascimento'))
df = df.withColumn('Data_Nascimento', concat_ws('-', 'Ano', 'Mes', 'Dia').cast(DateType()))
df.show(5)

+---------+------+-------+------------------+----------+-------------+--------------------+------+----+---+---+----+---------------+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nascimento|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|
+---------+------+-------+------------------+----------+-------------+--------------------+------+----+---+---+----+---------------+
|Argentina|     3|     DF|TAGLIAFICO Nicolas|31.08.1992|   TAGLIAFICO|      AFC Ajax (NED)|   169|  65| 31| 08|1992|     1992-08-31|
|Argentina|    22|     MF|    PAVON Cristian|21.01.1996|        PAVÓN|CA Boca Juniors (...|   169|  65| 21| 01|1996|     1996-01-21|
|Argentina|    15|     MF|    LANZINI Manuel|15.02.1993|      LANZINI|West Ham United F...|   167|  66| 15| 02|1993|     1993-02-15|
|Argentina|    18|     DF|    SALVIO Eduardo|13.07.1990|       SALVIO|    SL Benfica (POR)|   167|  69| 13| 07|1990|     1990-07-13|
|Argentina|    10|     FW|      MESSI Lionel|24.06.1987|        MESSI

In [33]:
df.printSchema()

root
 |-- Selecao: string (nullable = true)
 |-- Numero: integer (nullable = true)
 |-- Posicao: string (nullable = true)
 |-- Nome_FIFA: string (nullable = true)
 |-- Nascimento: string (nullable = true)
 |-- Nome Camiseta: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Altura: integer (nullable = true)
 |-- Peso: integer (nullable = true)
 |-- Dia: string (nullable = true)
 |-- Mes: string (nullable = true)
 |-- Ano: string (nullable = true)
 |-- Data_Nascimento: date (nullable = true)



#### Dropar colunas

In [36]:
df = df.drop('Nascimento')
df.show(5)

+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|
+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+
|Argentina|     3|     DF|TAGLIAFICO Nicolas|   TAGLIAFICO|      AFC Ajax (NED)|   169|  65| 31| 08|1992|     1992-08-31|
|Argentina|    22|     MF|    PAVON Cristian|        PAVÓN|CA Boca Juniors (...|   169|  65| 21| 01|1996|     1996-01-21|
|Argentina|    15|     MF|    LANZINI Manuel|      LANZINI|West Ham United F...|   167|  66| 15| 02|1993|     1993-02-15|
|Argentina|    18|     DF|    SALVIO Eduardo|       SALVIO|    SL Benfica (POR)|   167|  69| 13| 07|1990|     1990-07-13|
|Argentina|    10|     FW|      MESSI Lionel|        MESSI|  FC Barcelona (ESP)|   170|  72| 24| 06|1987|     1987-06-24|
+---------+------+------

#### Criando um backup

In [37]:
df2 = df

<b>Window ranking Function</b>

<ul>
  <li>Window Function 1 - Número de linhas - row_number()</li>
  <li>Window Function 2 - Ranking 1 - rank()</li>
  <li>Window Function 3 - Ranking 2 - dense_rank()</li>
  <li>Window Function 4 - Porcentagem ranking - percent_rank()</li>
  <li>Window Function 5 - Divisão em "N" partes - ntile()</li>
<ul>

#### Window Function 1 - Número de linhas - row_number()

In [44]:
# contagem por seleção, ordenando pela altura
num_linha = Window.partitionBy('Selecao').orderBy(desc('Altura')) #padrao asc
df.withColumn('numero_linha', row_number().over(num_linha)).show(30)

+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+------------+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|numero_linha|
+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+------------+
|Argentina|     6|     DF|    FAZIO Federico|        FAZIO|       AS Roma (ITA)|   199|  85| 17| 03|1987|     1987-03-17|           1|
|Argentina|     1|     GK|     GUZMAN Nahuel|       GUZMÁN|   Tigres UANL (MEX)|   192|  90| 10| 02|1986|     1986-02-10|           2|
|Argentina|    16|     DF|       ROJO Marcos|         ROJO|Manchester United...|   189|  82| 20| 03|1990|     1990-03-20|           3|
|Argentina|    12|     GK|     ARMANI Franco|       ARMANI|CA River Plate (ARG)|   189|  85| 16| 10|1986|     1986-10-16|           4|
|Argentina|    23|     GK|CABALLERO Wilfredo|    CABALL

#### Window Function 2 - Ranking 1 - rank()

In [47]:
# Jogadores com a mesma altura, fica com o mesmo ranking.
rank1 = Window.partitionBy('Selecao').orderBy(desc('Altura'))
df.withColumn('rank1', rank().over(rank1)).show(50)

+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+-----+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|rank1|
+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+-----+
|Argentina|     6|     DF|    FAZIO Federico|        FAZIO|       AS Roma (ITA)|   199|  85| 17| 03|1987|     1987-03-17|    1|
|Argentina|     1|     GK|     GUZMAN Nahuel|       GUZMÁN|   Tigres UANL (MEX)|   192|  90| 10| 02|1986|     1986-02-10|    2|
|Argentina|    16|     DF|       ROJO Marcos|         ROJO|Manchester United...|   189|  82| 20| 03|1990|     1990-03-20|    3|
|Argentina|    12|     GK|     ARMANI Franco|       ARMANI|CA River Plate (ARG)|   189|  85| 16| 10|1986|     1986-10-16|    3|
|Argentina|    23|     GK|CABALLERO Wilfredo|    CABALLERO|    Chelsea FC (ENG)|   186|  80| 28| 09|1981

Window Function 3 - Ranking 2 - dense_rank()

In [48]:
rank2 = Window.partitionBy('Selecao').orderBy(desc('Altura'))
df.withColumn('rank2', dense_rank().over(rank2)).show(30)

+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+-----+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|rank2|
+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+-----+
|Argentina|     6|     DF|    FAZIO Federico|        FAZIO|       AS Roma (ITA)|   199|  85| 17| 03|1987|     1987-03-17|    1|
|Argentina|     1|     GK|     GUZMAN Nahuel|       GUZMÁN|   Tigres UANL (MEX)|   192|  90| 10| 02|1986|     1986-02-10|    2|
|Argentina|    16|     DF|       ROJO Marcos|         ROJO|Manchester United...|   189|  82| 20| 03|1990|     1990-03-20|    3|
|Argentina|    12|     GK|     ARMANI Franco|       ARMANI|CA River Plate (ARG)|   189|  85| 16| 10|1986|     1986-10-16|    3|
|Argentina|    23|     GK|CABALLERO Wilfredo|    CABALLERO|    Chelsea FC (ENG)|   186|  80| 28| 09|1981

#### Window Function 4 - Porcentagem Ranking - percent_rank()

In [50]:
# Exibe a porcentagem de cada linha
porcentagem = Window.partitionBy('Selecao').orderBy(desc('Altura'))
df.withColumn('%', percent_rank().over(porcentagem)).show(30)

+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+--------------------+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|                   %|
+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+--------------------+
|Argentina|     6|     DF|    FAZIO Federico|        FAZIO|       AS Roma (ITA)|   199|  85| 17| 03|1987|     1987-03-17|                 0.0|
|Argentina|     1|     GK|     GUZMAN Nahuel|       GUZMÁN|   Tigres UANL (MEX)|   192|  90| 10| 02|1986|     1986-02-10|0.045454545454545456|
|Argentina|    16|     DF|       ROJO Marcos|         ROJO|Manchester United...|   189|  82| 20| 03|1990|     1990-03-20| 0.09090909090909091|
|Argentina|    12|     GK|     ARMANI Franco|       ARMANI|CA River Plate (ARG)|   189|  85| 16| 10|1986|     1986-10-16| 0.09090909090909091|

#### Window Function 5 - Divisão em 'N' partes - ntile()

In [52]:
# Divide cada seleção em 5 partes
parte = Window.partitionBy('Selecao').orderBy(desc('Altura'))
df.withColumn('Partes', ntile(5).over(parte)).show(30)

+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+------+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|Partes|
+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+------+
|Argentina|     6|     DF|    FAZIO Federico|        FAZIO|       AS Roma (ITA)|   199|  85| 17| 03|1987|     1987-03-17|     1|
|Argentina|     1|     GK|     GUZMAN Nahuel|       GUZMÁN|   Tigres UANL (MEX)|   192|  90| 10| 02|1986|     1986-02-10|     1|
|Argentina|    16|     DF|       ROJO Marcos|         ROJO|Manchester United...|   189|  82| 20| 03|1990|     1990-03-20|     1|
|Argentina|    12|     GK|     ARMANI Franco|       ARMANI|CA River Plate (ARG)|   189|  85| 16| 10|1986|     1986-10-16|     1|
|Argentina|    23|     GK|CABALLERO Wilfredo|    CABALLERO|    Chelsea FC (ENG)|   186|  80| 28| 

### Window Analytic Functions (Funções Análiticas)

#### Window Function 6 - LAG / Degrau - lag()

In [59]:
# Pega o peso da linha anterior e insere na prox. linha
degrau = Window.partitionBy('Selecao').orderBy(desc('Altura'))
df.withColumn('degrau', lag('Peso').over(degrau)).show(30)

+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+------+
|  Selecao|Numero|Posicao|         Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|degrau|
+---------+------+-------+------------------+-------------+--------------------+------+----+---+---+----+---------------+------+
|Argentina|     6|     DF|    FAZIO Federico|        FAZIO|       AS Roma (ITA)|   199|  85| 17| 03|1987|     1987-03-17|  null|
|Argentina|     1|     GK|     GUZMAN Nahuel|       GUZMÁN|   Tigres UANL (MEX)|   192|  90| 10| 02|1986|     1986-02-10|    85|
|Argentina|    16|     DF|       ROJO Marcos|         ROJO|Manchester United...|   189|  82| 20| 03|1990|     1990-03-20|    90|
|Argentina|    12|     GK|     ARMANI Franco|       ARMANI|CA River Plate (ARG)|   189|  85| 16| 10|1986|     1986-10-16|    82|
|Argentina|    23|     GK|CABALLERO Wilfredo|    CABALLERO|    Chelsea FC (ENG)|   186|  80| 28| 

Window Function 7 - Lead / Degrau - lead

In [None]:
degrau = Window.partitionBy('Selecao').orderBy(desc('Altura'))
df.withColumn('degrau', lead('Altura').over(degrau)).show(30)

#### Agregações

#### GroupBy + AGG 1

In [78]:
# df.groupBy('Selecao').mean('Altura').show()
df.groupBy('Selecao').agg({'Altura': 'avg'}).orderBy(desc('avg(Altura)')).show()

+--------------+------------------+
|       Selecao|       avg(Altura)|
+--------------+------------------+
|        Serbia|186.69565217391303|
|       Denmark| 186.6086956521739|
|       Germany| 185.7826086956522|
|        Sweden| 185.7391304347826|
|       Iceland|185.52173913043478|
|       Belgium|185.34782608695653|
|       Croatia| 185.2608695652174|
|       Nigeria|184.52173913043478|
|       IR Iran|184.47826086956522|
|        Russia| 184.3913043478261|
|       Senegal|183.65217391304347|
|        France|183.30434782608697|
|        Poland|183.17391304347825|
|       Tunisia|183.08695652173913|
|   Switzerland|182.91304347826087|
|       England| 182.7391304347826|
|       Morocco|182.69565217391303|
|        Panama|182.17391304347825|
|Korea Republic| 181.8695652173913|
|       Uruguay|181.04347826086956|
+--------------+------------------+
only showing top 20 rows



#### GroupBy + AGG 2

In [79]:
df.groupby('Selecao').agg(avg('Altura')).orderBy('avg(Altura)').show()

+--------------+------------------+
|       Selecao|       avg(Altura)|
+--------------+------------------+
|          Peru| 177.6086956521739|
|  Saudi Arabia|177.65217391304347|
|     Argentina|178.43478260869566|
|         Japan| 178.7826086956522|
|      Portugal| 179.7391304347826|
|        Mexico| 179.7826086956522|
|         Spain|179.91304347826087|
|    Costa Rica|180.69565217391303|
|        Brazil| 180.7826086956522|
|      Colombia| 180.7826086956522|
|     Australia| 180.8695652173913|
|         Egypt|             181.0|
|       Uruguay|181.04347826086956|
|Korea Republic| 181.8695652173913|
|        Panama|182.17391304347825|
|       Morocco|182.69565217391303|
|       England| 182.7391304347826|
|   Switzerland|182.91304347826087|
|       Tunisia|183.08695652173913|
|        Poland|183.17391304347825|
+--------------+------------------+
only showing top 20 rows



#### Where

In [80]:
# where = filter
df.where('Selecao = "Brazil"').show(5)

+-------+------+-------+-----------+-------------+--------------------+------+----+---+---+----+---------------+
|Selecao|Numero|Posicao|  Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|
+-------+------+-------+-----------+-------------+--------------------+------+----+---+---+----+---------------+
| Brazil|    18|     MF|       FRED|         FRED|FC Shakhtar Donet...|   169|  64| 05| 03|1993|     1993-03-05|
| Brazil|    21|     FW|     TAISON|       TAISON|FC Shakhtar Donet...|   172|  64| 13| 01|1988|     1988-01-13|
| Brazil|    17|     MF|FERNANDINHO|  FERNANDINHO|Manchester City F...|   179|  67| 04| 05|1985|     1985-05-04|
| Brazil|    22|     DF|     FAGNER|       FAGNER|SC Corinthians (BRA)|   168|  67| 11| 06|1989|     1989-06-11|
| Brazil|    10|     FW|     NEYMAR|    NEYMAR JR|Paris Saint-Germa...|   175|  68| 05| 02|1992|     1992-02-05|
+-------+------+-------+-----------+-------------+--------------------+------+----+---+---+----+

In [82]:
top1 = Window.partitionBy('Selecao').orderBy(desc('Altura'))
df.withColumn('Top', rank().over(top1)).filter('Top = "1"').show(5)

+---------+------+-------+----------------+-------------+--------------------+------+----+---+---+----+---------------+---+
|  Selecao|Numero|Posicao|       Nome_FIFA|Nome Camiseta|                Time|Altura|Peso|Dia|Mes| Ano|Data_Nascimento|Top|
+---------+------+-------+----------------+-------------+--------------------+------+----+---+---+----+---------------+---+
|Argentina|     6|     DF|  FAZIO Federico|        FAZIO|       AS Roma (ITA)|   199|  85| 17| 03|1987|     1987-03-17|  1|
|Australia|    12|     GK|      JONES Brad|        JONES|Feyenoord Rotterd...|   193|  87| 19| 03|1982|     1982-03-19|  1|
|  Belgium|     1|     GK|COURTOIS Thibaut|     COURTOIS|    Chelsea FC (ENG)|   199|  91| 11| 05|1992|     1992-05-11|  1|
|   Brazil|    16|     GK|          CASSIO|       CASSIO|SC Corinthians (BRA)|   195|  92| 06| 06|1987|     1987-06-06|  1|
| Colombia|    13|     DF|      MINA Yerry|      Y. MINA|  FC Barcelona (ESP)|   194|  95| 23| 09|1994|     1994-09-23|  1|
+-------

#### Describe

In [84]:
df.describe().show()
#df.where("Selecao" = ''Brazil').describe()

+-------+---------+-----------------+-------+------------+-------------+--------------------+-----------------+-----------------+------------------+------------------+------------------+
|summary|  Selecao|           Numero|Posicao|   Nome_FIFA|Nome Camiseta|                Time|           Altura|             Peso|               Dia|               Mes|               Ano|
+-------+---------+-----------------+-------+------------+-------------+--------------------+-----------------+-----------------+------------------+------------------+------------------+
|  count|      736|              736|    736|         736|          736|                 736|              736|              736|               736|               736|               736|
|   mean|     null|             12.0|   null|        null|         null|                null|182.4076086956522|77.18885869565217|15.793478260869565|5.8790760869565215| 1990.110054347826|
| stddev|     null|6.637760461599851|   null|        null|       

Window Function 8 - Função de Agregação usando Window Function

In [86]:
parametro = Window.partitionBy('Selecao').orderBy(desc('Altura'))
parametro2 = Window.partitionBy('Selecao')

df.withColumn('linhax', row_number().over(parametro))\
.withColumn('media', avg('Altura').over(parametro2))\
.withColumn('max', max('Altura').over(parametro2))\
.withColumn('min', min('Altura').over(parametro2))\
.filter('linhax = "1"').select('Selecao', 'media', 'max', 'min')\
.orderBy('media', ascending=False).show(30)

+--------------+------------------+---+---+
|       Selecao|             media|max|min|
+--------------+------------------+---+---+
|        Serbia|186.69565217391303|195|169|
|       Denmark| 186.6086956521739|200|171|
|       Germany| 185.7826086956522|195|176|
|        Sweden| 185.7391304347826|198|177|
|       Iceland|185.52173913043478|198|170|
|       Belgium|185.34782608695653|199|169|
|       Croatia| 185.2608695652174|201|172|
|       Nigeria|184.52173913043478|197|172|
|       IR Iran|184.47826086956522|194|177|
|        Russia| 184.3913043478261|196|173|
|       Senegal|183.65217391304347|196|173|
|        France|183.30434782608697|197|168|
|        Poland|183.17391304347825|195|172|
|       Tunisia|183.08695652173913|192|170|
|   Switzerland|182.91304347826087|192|165|
|       England| 182.7391304347826|196|170|
|       Morocco|182.69565217391303|190|167|
|        Panama|182.17391304347825|197|165|
|Korea Republic| 181.8695652173913|197|170|
|       Uruguay|181.043478260869