### *Window Function()*
PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row 
individually.It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and finally how to use them with PySpark 
SQL and PySpark DataFrame API. 

### *Ranking Function*
The function returns the statistical rank of a given value for each row in a partition or group. The goal of this function is to provide consecutive numbering of 
the rows in the resultant column, set by the order selected in the Window.partition for each partition specified in the OVER clause.
E.g. row_number(), rank(), dense_rank(), etc.

In [0]:
# Creating Dataframe

data = [
    ("James", "Developer", 3000),
    ("Michael", "Testing", 2000),
    ("Tony", "Developer", 4000),
    ("Parker", "Support", 3000),
    ("Stephen", "Developer", 4000),
    ("Clark", "Support", 3500),
    ("Bruce", "Testing", 3000),
    ("Allen", "Developer", 3500),
    ("Loki", "Support", 3000),
    ("Buttowski", "Developer", 5000),
]

schema = ["Name", "Role", "Salary"]

In [0]:
df = spark.createDataFrame(data, schema)

In [0]:
df.display()

Name,Role,Salary
James,Developer,3000
Michael,Testing,2000
Tony,Developer,4000
Parker,Support,3000
Stephen,Developer,4000
Clark,Support,3500
Bruce,Testing,3000
Allen,Developer,3500
Loki,Support,3000
Buttowski,Developer,5000


In [0]:
from pyspark.sql.window import Window

#### *To perform window function operation on a group of rows first, we need to partition i.e. define the group of data rows using window.partition() function, and for row number and rank function we need to additionally order by on partition data using ORDER BY clause.* 

In [0]:
window = Window.partitionBy("Role").orderBy("Salary")

In [0]:
from pyspark.sql.functions import row_number, rank, dense_rank, percent_rank, cume_dist

#### *row_number().*
row_number() function is used to gives a sequential number to each row present in the table. 

In [0]:
# row_number()
df.withColumn("Row_number", row_number().over(window)).display()

Name,Role,Salary,Row_number
James,Developer,3000,1
Allen,Developer,3500,2
Tony,Developer,4000,3
Stephen,Developer,4000,4
Buttowski,Developer,5000,5
Parker,Support,3000,1
Loki,Support,3000,2
Clark,Support,3500,3
Michael,Testing,2000,1
Bruce,Testing,3000,2


#### *rank()*
The rank function is used to give ranks to rows specified in the window partition. This function leaves gaps in rank if there are ties.

In [0]:
# rank() 
df.withColumn("rank", rank().over(window)).display()

Name,Role,Salary,rank
James,Developer,3000,1
Allen,Developer,3500,2
Tony,Developer,4000,3
Stephen,Developer,4000,3
Buttowski,Developer,5000,5
Parker,Support,3000,1
Loki,Support,3000,1
Clark,Support,3500,3
Michael,Testing,2000,1
Bruce,Testing,3000,2


#### *dense_rank()*
This function is used to get the rank of each row in the form of row numbers. This is similar to rank() function, there is only one difference the rank function leaves gaps in rank when there are ties. Here it won't leave any gaps between number if there are repeated values present.

In [0]:
# dense_rank() 
df.withColumn("dense_rank", dense_rank().over(window)).display()

Name,Role,Salary,dense_rank
James,Developer,3000,1
Allen,Developer,3500,2
Tony,Developer,4000,3
Stephen,Developer,4000,3
Buttowski,Developer,5000,4
Parker,Support,3000,1
Loki,Support,3000,1
Clark,Support,3500,2
Michael,Testing,2000,1
Bruce,Testing,3000,2


In [0]:
#percent_rank() Calculates the relative rank of each row in a window partition as a value between 0 and 1.

df.withColumn("percent_rank",percent_rank().over(window)).display()

Name,Role,Salary,percent_rank
James,Developer,3000,0.0
Allen,Developer,3500,0.25
Tony,Developer,4000,0.5
Stephen,Developer,4000,0.5
Buttowski,Developer,5000,1.0
Parker,Support,3000,0.0
Loki,Support,3000,0.0
Clark,Support,3500,1.0
Michael,Testing,2000,0.0
Bruce,Testing,3000,1.0


In [0]:
#cume_dist() window function is used to get the cumulative distribution of values within a window partition.

df.withColumn("cume_dist", cume_dist().over(window)).display()

Name,Role,Salary,cume_dist
James,Developer,3000,0.2
Allen,Developer,3500,0.4
Tony,Developer,4000,0.8
Stephen,Developer,4000,0.8
Buttowski,Developer,5000,1.0
Parker,Support,3000,0.6666666666666666
Loki,Support,3000,0.6666666666666666
Clark,Support,3500,1.0
Michael,Testing,2000,0.5
Bruce,Testing,3000,1.0
