# Chapter 10 Your data under a different lens: Window functions
we will disucss
- Window functions and the kind of data transformation they enable
- Summarizing, ranking, and analyzing data using the different classes of window functions
- Building static, growing, and unbounded windows to your functions
- Apply UDF to windows as custom window functions

In [None]:
import sys
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
import pyspark.sql.functions as F
import pyspark.sql.types as T
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pyspark.sql.window import Window

import warnings
warnings.filterwarnings('ignore')

# change the account name to your email account
account='sli'

# define a root path to access the data in the DataAnalysisWithPythonAndPySpark
data_path='/net/clusterhn/home/'+account+'/isa460/data/'

# check if the Spark session is active. If it is activate, close it

try:
    if spark:
        spark.stop()
except:
    pass    

spark = (SparkSession.builder.appName("Multidimensional Data Frame")
        .config("spark.port.maxRetries", "100")
        .config("spark.sql.mapKeyDedupPolicy", "LAST_WIN")  # This configuration allow the duplicate keys in the map data type.
         .config("spark.sql.legacy.timeParserPolicy", "LEGACY")
        .config("spark.driver.memory", "8g")
        .config("spark.driver.executor","8g")
        .getOrCreate())

# confiture the log level (defaulty is WARN)
spark.sparkContext.setLogLevel('ERROR')

# Weather data
For this exercise, we will use the [National Oceanic and Atmospheric Administration’s (NOAA) Global Surface Summary of the Day (GSOD) data set](https://catalog.data.gov/dataset/global-surface-summary-of-the-day-gsod1). I have downloaed daily Boston weather (Boston Logon weather station 725090) from Google BigQuery. We will focus on weather data in Boston between 2010 and 2024.

In [None]:
# load data
df=spark.read.csv(data_path+'boston_weather', header=True, inferSchema=True)

df.printSchema()

In [None]:
df.count()

In [None]:
df.show()

In [None]:
# create a date column based on year, month and day, create another column for year-month

df=df.withColumn('date', F.to_date(F.concat_ws('-', df['year'], df['mo'], df['da']), 'yyyy-MM-dd')).orderBy('date')\
     .withColumn('year_month', F.to_date(F.concat_ws('-', df['year'], df['mo']), 'yyyy-MM')).orderBy('date')

df.limit(10).show()

## Visualize daily temperture

## Identify the coldest day of each year

In [None]:
from pyspark.sql.window import Window

windowSpec=Window.partitionBy('year')

result=df.withColumn('coldestDay', F.min('temp').over(windowSpec))

result.where('temp=coldestDay').show()

#result.where('temp=coldestDay').drop('coldestDay').show()

## Identify the hottest day of each year

## Ranking functions

This section covers ranking functions: 
- nonconsecutive ranks with rank()
- consecutive ranks with dense_rank()
- percentile ranks with percent_rank()
- tiles with ntile(), 
- finally a bare row number with row_number()

Ranking functions are used for getting the top (or bottom) record for each window partition, or, more generally, for getting an order according to some column’s value.

### Identify the top 3 hottest days per year

In [None]:
windowSpec=Window.partitionBy('year').orderBy(F.desc('temp'))



### Identify the top 5% of the hottest day per year

use percent_rank()

### Split the temp per year into 10 equal buckets (decile)

use ntile()

In [None]:
# check the temp in decile 1 (10% of the coldest temperature in each year)



### Add a row number to your data frame, ignore tie

use row_number()

In [None]:
windowSpec=Window.partitionBy('year').orderBy('temp')

df.withColumn('row_number', F.row_number().over(windowSpec)).show()

## Access the records before or after using lag() and lead()

### Display average daily temp change by month

#### Visualize the result

#### Display avg, max and min temp change by month

#### Visualize the result

## Spark also provides the rowsBetween() and rangeBetween() methods to create window frame boundaries.

### Display three days moving average temp for each month

In [None]:
windowSpec=Window.partitionBy('year', 'mo').orderBy('da').rowsBetween(-2,0)

df1=df.withColumn('3_day_moving_avg', F.avg('temp').over(windowSpec))

df1.show()

# Summary

- Window functions are functions that are applied over a portion of a data frame called a window frame. They can perform aggregation, ranking, or analytical operations. A window function will return the data frame with the same number of records, unlike its siblings the groupby-aggregate operation and the group map UDF.
- A window frame is defined through a window spec. A window spec mandates how the data frame is split (partitionBy()), how it’s ordered (orderBy()), and how it’s portioned (rowsBetween()/rangeBetween()).
- By default, an unordered window frame will be unbounded, meaning that the window frame will be equal to the window partition for every record. An ordered window frame will grow to the left, meaning that each record will have a window frame ranging from the first record in the window partition to the current record.
- A window can be bounded by row, meaning that the records included in the window frame are tied to the row boundaries passed as parameters (with the range boundaries added to the row number of the current row), or by range, meaning that the records included in the window frame depend on the value of the current row (with the range boundaries added to the value).