# 1) Distribution Window Functions

The last lesson of the SQL window functions course delves into **distribution or statistical window functions**, which provide insights into data distribution and are used chiefly for statistical analysis.

The distribution window functions have diverse applications across various domains, including marketing, finance, healthcare, sports, and education. They allow data analysts to gain valuable insights from data and make data-driven decisions. For example, these functions can help marketers identify the top-performing products, campaigns, or customers based on their sales or engagement metrics, allowing them to optimize their marketing strategy. Also, they can be used for calculating the percentiles of medical data such as blood pressure, BMI, or cholesterol levels, helping physicians or researchers diagnose diseases or assess the health status of patients.

We'll cover two types of distribution window functions:

1. **Rank distribution** functions are used to calculate the relative rank of a specific row in a window partition based on the ordering specified in the `OVER()` clause.

    * `PERCENT_RANK()`

    * `CUME_DIST()`

2. **Inverse distribution** functions are used to calculate the value at a specified percentile in a group based on the ordering specified in the WITHIN GROUP() clause.

    * `PERCENTILE_CONT()`

    * `PERCENTILE_DISC()`

# 2) The Rank Distribution Functions - CUME_DIST()

As mentioned, two variants of rank distribution functions exist: `CUME_DIST()` and `PERCENT_RANK()`. They compute a row's relative rank in the window partition and return a ratio between 0 and 1.

The first rank distribution function that we'll discuss is `CUME_DIST()`.


**The CUME_DIST() Function**

The `CUME_DIST()` function, which takes no argument, calculates the cumulative distribution of values in a column specified in the `ORDER BY` subclause. In other words, assuming the data is sorted in **ascending order**, the cumulative distribution of a value in a particular row is determined as the number of rows with values less than or equal to that value divided by the number of rows existing in the window partition.

The basic syntax of the `CUME_DIST()` function is as follows:

```sql
CUME_DIST() OVER ( [PARTITION BY expression, ... ]
                   ORDER BY expression [ASC | DESC], ... )
```

`PARTITION BY‍`: an optional subclause that divides the result set into partitions based on the values of one or more columns or expressions.

`ORDER BY`: a mandatory subclause that specifies the column or expression used to sort the rows within each partition.

Consider the `trips` table containing twelve bike trips as follows:

| start_date                 | end_date                   | duration | start_station_n | start_station     | end_station_n | end_station                              | bike_n | member | rider_rt |
|----------------------------|----------------------------|----------|-----------------|-------------------|---------------|------------------------------------------|--------|--------|----------|
| 2017-10-01 03:08:00.000000 | 2017-10-01 03:29:00.000000 | 1253     | 31002           | 20th & Crystal Dr | 31041         | Prince St & Union St                     | W23272 | Member | 3        |
| 2017-10-01 05:01:00.000000 | 2017-10-01 05:26:00.000000 | 1476     | 31002           | 20th & Crystal Dr | 31010         | S Glebe & Potomac Ave                    | W23254 | Member | 2        |
| 2017-10-01 05:01:00.000000 | 2017-10-01 05:12:00.000000 | 650      | 31002           | 20th & Crystal Dr | 31011         | 23rd & Crystal Dr                        | W00143 | Member | 3        |
| 2017-10-02 03:30:00.000000 | 2017-10-02 03:47:00.000000 | 987      | 31002           | 20th & Crystal Dr | 31249         | Jefferson Memorial                       | W21096 | Member | 4        |
| 2017-10-03 09:34:00.000000 | 2017-10-03 09:47:00.000000 | 797      | 31002           | 20th & Crystal Dr | 31518         | New York Ave & Hecht Ave NE              | W20095 | Member | 4        |
| 2017-10-03 12:00:00.000000 | 2017-10-03 12:23:00.000000 | 1390     | 31002           | 20th & Crystal Dr | 31247         | Jefferson Dr & 14th St SW                | W22965 | Casual | 5        |
| 2017-10-04 04:58:00.000000 | 2017-10-04 05:11:00.000000 | 797      | 31002           | 20th & Crystal Dr | 31503         | Florida Ave & R St NW                    | W23052 | Casual | 5        |
| 2017-10-04 05:21:00.000000 | 2017-10-04 05:36:00.000000 | 918      | 31002           | 20th & Crystal Dr | 31506         | 1st & Rhode Island Ave NW                | W22051 | Casual | 3        |
| 2017-10-04 06:07:00.000000 | 2017-10-04 06:22:00.000000 | 918      | 31002           | 20th & Crystal Dr | 31126         | 11th & Girard St NW                      | W23268 | Member | 4        |
| 2017-10-04 08:30:00.000000 | 2017-10-04 08:45:00.000000 | 918      | 31002           | 20th & Crystal Dr | 31235         | 19th St & Constitution Ave NW            | W22517 | Casual | 5        |
| 2017-10-05 08:08:00.000000 | 2017-10-05 08:11:00.000000 | 202      | 31002           | 20th & Crystal Dr | 31009         | 27th & Crystal Dr                        | W20184 | Member | 3        |
| 2017-10-05 08:08:00.000000 | 2017-10-05 08:33:00.000000 | 1482     | 31002           | 20th & Crystal Dr | 31633         | Independence Ave & L'Enfant Plaza SW/DOE | W00895 | Casual | 4        |

Now, let's try the following query to see the cumulative distribution of the `duration` column:

```sql
SELECT start_date, bike_number, duration,
       ROUND(CUME_DIST() OVER(ORDER BY duration)::numeric, 2) AS cume_dist
  FROM trips;

```

| start_date                 | bike_number | duration | cume_dist |
|----------------------------|-------------|----------|-----------|
| 2017-10-05 08:08:00.000000 | W20184      | 202      | 0.08      |
| 2017-10-01 05:01:00.000000 | W00143      | 650      | 0.17      |
| 2017-10-04 04:58:00.000000 | W23052      | 797      | 0.33      |
| 2017-10-03 09:34:00.000000 | W20095      | 797      | 0.33      |
| 2017-10-04 05:21:00.000000 | W22051      | 918      | 0.58      |
| 2017-10-04 06:07:00.000000 | W23268      | 918      | 0.58      |
| 2017-10-04 08:30:00.000000 | W22517      | 918      | 0.58      |
| 2017-10-02 03:30:00.000000 | W21096      | 987      | 0.67      |
| 2017-10-01 03:08:00.000000 | W23272      | 1253     | 0.75      |
| 2017-10-03 12:00:00.000000 | W22965      | 1390     | 0.83      |
| 2017-10-01 05:01:00.000000 | W23254      | 1476     | 0.92      |
| 2017-10-05 08:08:00.000000 | W00895      | 1482     | 1         |

The `ROUND()` function rounds the cumulative distribution value to two decimal places.

It's worth noting that without casting the result to a numeric data type, the `ROUND()` function would not apply to the output of the `CUME_DIST()` function.

We may ask why the value returned by `CUME_DIST()` for the **first row is not zero**. This is because the first value is always included in the distribution. Hence the first value of the CUME_DIST() function represents the proportion of rows in the window partition whose values are **less than or equal to the first value**.

The above query calculates the cumulative distribution of each `duration` value with respect to the other values in the duration column within the window sorted ascendingly based on the `duration` column.

The first row shows that the duration value of 202 has a cumulative distribution of `0.08`. This means that approximately **8.0% of the duration values in the table are less than or equal to 202**. Similarly, the cumulative distribution of the second row's duration is 0.17, which means approximately 17% of the duration values are less than or equal to 650.

The analysis becomes more interesting when we come to the third and fourth rows with duration values of 797. All have the same cumulative distribution of 0.33. This means that approximately 33% of the duration values in the table are less than or equal to 797.

## Instructions

The bike-sharing company wants to promote safe riding habits by encouraging users to prioritize safety while still enjoying bike riding. To do so, they plan to identify the fastest riders for each day and send them a notification to remind them that safety comes first.

For this exercise, we assume that for each day, if the cumulative distribution of a trip duration falls in the top 5% (i.e., **has a cumulative distribution greater than or equal to 0.95**), the trip is a fast trip, and the trip's rider is considered a fast rider.

Write a query against the tbl_bikeshare table to select the fast riders for each day.

1. Create a CTE called "fast_rides" that performs the following steps:
    
    * Select the `start_date`, `duration`, and `bike_number` columns.

    * Use the `EXTRACT` function to pull the day from the `start_date` column.

    * Calculate the cumulative distribution of trip durations for each day using the `CUME_DIST()` window function, partitioned by day and ordered by duration in descending order.

    * Use a `CASE` statement to assign the label '**Fast Rider**' to the riders whose trips' durations fall in the top 5% (i.e., have a cumulative distribution greater than or equal to 0.95). Alias this new column as `fast_riders`.

1. In the main query, select all columns from the "fast_rides" CTE.

1. Filter the main query results to show only the rows labeled 'Fast Rider' in the `fast_riders` column.

1. Order the final result set by start_date and duration.

In [None]:
%%sql
WITH fast_rides AS(
     SELECT start_date,EXTRACT(day from start_date)AS day, duration, bike_number,
            CUME_DIST() OVER(PARTITION BY EXTRACT(day from start_date)
                             ORDER BY duration DESC) AS cume_dist,
            CASE
            WHEN CUME_DIST() OVER(PARTITION BY EXTRACT(day from start_date)
                             ORDER BY duration DESC) >= 0.95 THEN 'Fast Rider'
            ELSE ''
            END AS fast_riders
FROM tbl_bikeshare)
    
SELECT *
FROM fast_rides
WHERE fast_riders = 'Fast Rider'
ORDER BY start_date, duration

# 3) The Rank Distribution Functions - PERCENT_RANK()

**The PERCENT_RANK() Function**

Although the `PERCENT_RANK()` function is similar to the `CUME_DIST()` function, it generates quite different results than the CUME_DIST() function.

Let's first look at the basic syntax of the function and then discuss it in detail.

```sql
PERCENT_RANK() OVER ( [PARTITION BY expression, ... ]
                   ORDER BY expression [ASC | DESC], ... )
```

`‍PARTITION BY‍:` an optional subclause that divides the result set into partitions based on the values of one or more columns or expressions.

`ORDER BY`: a mandatory subclause that specifies the column or expression used to sort the rows within each partition.

The PERCENT_RANK() function is defined as the rank of a value minus one divided by the number of rows in the window partition minus one. This enables performing operations such as **dividing data into quartiles** for analytical purposes.

$\text{Percent Rank} = \frac{Rank- 1}{\text{Number of Rows -1}}$

Now, let's try the following query against the trips table:

```sql
SELECT start_date, bike_number, duration,
       ROUND(PERCENT_RANK() OVER(ORDER BY duration ASC)::NUMERIC,2) AS percent_rank
  FROM trips;
  ```

  | start_date                 | bike_number | duration | percent_rank |
|----------------------------|-------------|----------|--------------|
| 2017-10-05 08:08:00.000000 | W20184      | 202      | 0            |
| 2017-10-01 05:01:00.000000 | W00143      | 650      | 0.09         |
| 2017-10-04 04:58:00.000000 | W23052      | 797      | 0.18         |
| 2017-10-03 09:34:00.000000 | W20095      | 797      | 0.18         |
| 2017-10-04 05:21:00.000000 | W22051      | 918      | 0.36         |
| 2017-10-04 06:07:00.000000 | W23268      | 918      | 0.36         |
| 2017-10-04 08:30:00.000000 | W22517      | 918      | 0.36         |
| 2017-10-02 03:30:00.000000 | W21096      | 987      | 0.64         |
| 2017-10-01 03:08:00.000000 | W23272      | 1253     | 0.73         |
| 2017-10-03 12:00:00.000000 | W22965      | 1390     | 0.82         |
| 2017-10-01 05:01:00.000000 | W23254      | 1476     | 0.91         |
| 2017-10-05 08:08:00.000000 | W00895      | 1482     | 1            |

**NOTE**

Unlike the `CUME_DIST()`, the first value for `PERCENT_RANK()` will always be `0` because the first row is not ranked above any other. In other words, assuming ascending ordering, the `PERCENT_RANK()` function assigns a percentile rank of `0` to the lowest value since it's the only row in the result set with a value less than the others.

It's also worth noting that while `CUME_DIST()` calculates the proportion of values that are less than or equal to a given value, representing the cumulative distribution of the data, `PERCENT_RANK()` computes the relative rank of a value within the dataset, considering only the values that are strictly less than the given value, and excluding the current value itself.

As shown in the result set, the `PERCENT_RANK()` function assigns a percentile rank to each row based on its position within the ordered result set. This percentile rank represents the **percentage of rows in the result set that are ranked lower than the current row or above the current row** in the window partition.

Let's see how the `PERCENT_RANK()` function allows us to solve real-world problems.

Suppose you're a data analyst at a cell phone store and want to analyze the monthly sales performance of two phone brands, Apple and Samsung, based on their revenue quartiles. You have a table named `phone_sales_by_month`, which contains the monthly sales data by phone brand and model as follows:

| sales_date | brand   | model            | quantity | unit_price |
|------------|---------|------------------|----------|------------|
| 2022-01-31 | Apple   | iPhone 13 Pro    | 50       | 999.00     |
| 2022-01-31 | Samsung | Galaxy Z Fold 4  | 30       | 650.00     |
| 2022-01-31 | Samsung | Galaxy S22 Ultra | 40       | 799.00     |
| 2022-02-28 | Apple   | iPhone 13 Pro    | 40       | 999.00     |
| 2022-02-28 | Samsung | Galaxy Z Fold 4  | 35       | 650.00     |
| 2022-03-31 | Samsung | Galaxy A53       | 25       | 415.00     |
| 2022-03-31 | Apple   | iPhone 13 Pro    | 38       | 999.00     |
| 2022-03-31 | Samsung | Galaxy Z Fold 4  | 60       | 650.00     |
| 2022-04-30 | Apple   | iPhone 13 Pro    | 30       | 999.00     |
| 2022-04-30 | Samsung | Galaxy S22 Ultra | 25       | 799.00     |
| 2022-05-31 | Samsung | Galaxy Z Fold 4  | 30       | 650.00     |
| 2022-05-31 | Apple   | iPhone 13 Pro    | 60       | 999.00     |
| 2022-06-30 | Apple   | iPhone 13 Pro    | 20       | 999.00     |
| 2022-06-30 | Samsung | Galaxy A53       | 45       | 415.00     |
| 2022-06-30 | Samsung | Galaxy Z Fold 4  | 76       | 650.00     |

To analyze the monthly sales performance, you need to calculate the percentile rank of the revenue generated by each phone model and use it to determine the revenue quartiles for each model. The following query returns the desired results:

```sql
SELECT *,
       PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) AS percent_rank,
       CASE
           WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.25 THEN '1st'
           WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.50 THEN '2nd'
           WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.75 THEN '3rd'
           ELSE '4th'
       END AS revenue_quartiles
FROM phone_sales_by_month;
```

**NOTE**
> 
In this case, a lower PERCENT_RANK indicates higher performance because we're ordering by revenue in descending order. The first quartile (0-25th percentile) represents the top 25% of performers, the second quartile (25-50th percentile) represents the next 25%, and so on.

| sales_date | brand   | model            | quantity | unit_price | percent_rank | revenue_quartiles |
|------------|---------|------------------|----------|------------|--------------|-------------------|
| 2022-05-31 | Apple   | iPhone 13 Pro    | 60       | 999.00     | 0            | 1st               |
| 2022-01-31 | Apple   | iPhone 13 Pro    | 50       | 999.00     | 0.2          | 1st               |
| 2022-02-28 | Apple   | iPhone 13 Pro    | 40       | 999.00     | 0.4          | 2nd               |
| 2022-03-31 | Apple   | iPhone 13 Pro    | 38       | 999.00     | 0.6          | 3rd               |
| 2022-04-30 | Apple   | iPhone 13 Pro    | 30       | 999.00     | 0.8          | 4th               |
| 2022-06-30 | Apple   | iPhone 13 Pro    | 20       | 999.00     | 1            | 4th               |
| 2022-06-30 | Samsung | Galaxy Z Fold 4  | 76       | 650.00     | 0            | 1st               |
| 2022-03-31 | Samsung | Galaxy Z Fold 4  | 60       | 650.00     | 0.125        | 1st               |
| 2022-01-31 | Samsung | Galaxy S22 Ultra | 40       | 799.00     | 0.25         | 2nd               |
| 2022-02-28 | Samsung | Galaxy Z Fold 4  | 35       | 650.00     | 0.375        | 2nd               |
| 2022-04-30 | Samsung | Galaxy S22 Ultra | 25       | 799.00     | 0.5          | 3rd               |
| 2022-01-31 | Samsung | Galaxy Z Fold 4  | 30       | 650.00     | 0.625        | 3rd               |
| 2022-05-31 | Samsung | Galaxy Z Fold 4  | 30       | 650.00     | 0.625        | 3rd               |
| 2022-06-30 | Samsung | Galaxy A53       | 45       | 415.00     | 0.875        | 4th               |
| 2022-03-31 | Samsung | Galaxy A53       | 25       | 415.00     | 1            | 4th               |

The `CASE` statement assigns a revenue quartile to each monthly sale based on their percent rank. If their percent rank is less than 0.25, they're assigned to the first quartile; if it's less than 0.50, they're assigned to the second quartile, and so on. Any monthly sales with a percent rank greater than or equal to 0.75 are assigned to the fourth quartile.

Note that a PERCENT_RANK close to 0 indicates top performance, while a PERCENT_RANK close to 1 indicates lower performance in this context

## Instructions

The management of the cell phone store wants to analyze the store's sales performance by identifying the top-performing cell phone models each month and returning the monthly revenue generated by those models.

A model's monthly revenue is considered top-performing if its percentile rank falls into the **first revenue quartile**. To fulfill the management requirement, write a query against the `phone_sales_by_month` table.

1. Create a CTE called "monthly_sales" that performs the following steps:

    * Calculate the monthly sales revenue for each cell phone model by multiplying the `quantity` and { }, then sum the result. Alias this result as `total_sales`.
    
    * Group the data by `sales_date`, `brand`, and `model`.

    * Use the `PERCENT_RANK()` window function to calculate the revenue percentile rank for each model's monthly sales.
    
    * Partition the data by `sales_date` and order the partitions by the calculated total sales in descending order.

1. In the main query, select the `sales_date`, `brand`, `model`, and `total_sales` from the "monthly_sales" CTE.

1. Identify the top-performing models by filtering out the rows where the total sale's percentile rank is **less than or equal to 0.25**.

In [None]:
%%sql
WITH monthly_sales AS (
    SELECT sales_date, brand, model,
           SUM(quantity*unit_price) AS total_sales,
           percent_rank() OVER(PARTITION BY sales_date
                               ORDER BY SUM(quantity*unit_price) DESC) AS rank
    FROM phone_sales_by_month
    GROUP BY sales_date, brand, model)
    
    
SELECT sales_date, brand, model, total_sales
FROM monthly_sales
WHERE rank <= 0.25

# 4) The Inverse Distribution Functions - Part 1

We already discussed rank distribution functions in the previous section, which compute the relative rank of the current row in the window partition and return a ratio between 0 and 1 that can be expressed as a percentage between 0 and 100.

Inverse distribution functions **take a given percentage and return the value** or an estimated (interpolated) value associated with that percentage within the data range.

Consider the following list of sorted values: `1, 4, 5, 41, 100, 2100, 79000` The fiftieth percentile or the **median** of the given values is `41`. Since the set has seven values, the median falls at the fourth rank, which means that 50% of the values are below rank four, and 50% are above rank four.

consider the following list of sorted values:

`1, 4, 5, 41, 100, 2100`

Now we have six values, so we can't just take the value in the middle of the list as the median.

To find the median in this case, we need to take the average of the two middle values. The middle two values in this case are 5 and 41, so we add them together and divide by two to get (5 + 41) / 2 = 23. So the median of this list is `23`, which is an interpolated value between the two middle values.

Before discussing the two inverse distribution functions, let's discuss a new SQL clause, `WITHIN GROUP`.

`WITHIN GROUP` is a particular clause where we can indicate the ordering expression, enabling aggregate and analytics functions to operate on a group of sorted rows rather than a group of unsorted rows created by the `GROUP BY` clause.

**NOTE**

ANSI SQL standard does not explicitly mention the WITHIN GROUP clause as part of the standard. However, the WITHIN GROUP clause is widely supported by various SQL database management systems, such as PostgreSQL, Oracle, SQL Server, and others.

----------

Rank distribution functions are window functions that assign a different percentile rank to each row in a window partition. However, since inverse distribution functions require a specified percentile as an input, an ordering specification within a group, and returning a single result per group, they are used as grouped functions.

The basic syntax of the WITHIN GROUP clause is as follows:

```sql
function_name() WITHIN GROUP (ORDER BY column_expression [ ASC | DESC ])
```

# 5) The Inverse Distribution Functions - Part 2

There are two inverse distribution functions, `PERCENTILE_DISC()` and `PERCENTILE_CONT()`.

The `PERCENTILE_DISC()` (percentile discrete) and `PERCENTILE_CONT()` (percentile continuous) functions take a percent rank and find the value at that position within a group of rows.

According to the PostgreSQL documentation, the two functions belong to a group of functions called **Ordered-Set Aggregate Functions**, and the syntax is as follows:

```sql
PERCENTILE_CONT(percent) WITHIN GROUP(ORDER BY expression)

PERCENTILE_DISC(percent) WITHIN GROUP(ORDER BY expression)

```

**NOTE**

Although `PERCENTILE_DISC()` and `PERCENTILE_CONT()` functions are related to window functions and share some similarities, they're not considered window functions themselves. They fall under the category of ordered-set aggregate functions, which also operate on a set of rows with specified ordering, but have different syntax and usage patterns. This means that, unlike window functions, we can use these two functions in the WHERE clause of an SQL query, but we must use a subquery or a common table expression (CTE) to perform the calculation first, then reference the result in the WHERE clause.

----

Let's see how we use the percentile functions in practice and their differences.

We've earlier explored the `phone_sales_quantity_by_month` table containing the sold phone quantities in Q1 and Q2 2022 as follows:

| sales_date | brand   | quantity |
|------------|---------|----------|
| 2022-01-31 | Apple   | 110      |
| 2022-01-31 | Samsung | 117      |
| 2022-02-28 | Samsung | 75       |
| 2022-02-28 | Apple   | 60       |
| 2022-03-31 | Apple   | 85       |
| 2022-03-31 | Samsung | 86       |
| 2022-04-30 | Apple   | 134      |
| 2022-04-30 | Samsung | 124      |
| 2022-05-31 | Samsung | 80       |
| 2022-05-31 | Apple   | 90       |
| 2022-06-30 | Apple   | 100      |
| 2022-06-30 | Samsung | 89       |

Let's use the percentile functions to calculate the **median** of sold phone quantities.

First, we'll try the `PERCENTILE_CONT()` function.

```sql
SELECT
 PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY quantity) as "Median of Quantity"
  FROM phone_sales_quantity_by_month;
  ```

Since the table contains an even number of rows (12 rows), **there's no single middle value** in the quantity column. Therefore, as its name implies, the `PERCENTILE_CONT()` function returns the interpolated value between the two values in the middle of the sorted column. This interpolated value is calculated by taking the average of those two values, resulting in **89.5**.

Now, let's try the `PERCENTILE_DISC()` function.

```sql
SELECT
 PERCENTILE_DISC(0.50) WITHIN GROUP(ORDER BY quantity) as "Median of Quantity"
  FROM phone_sales_quantity_by_month;
  ```

Median of Quantity
89

By looking at the result we can understand, the `PERCENTILE_DISC()` function, as its name implies, returns the **closest discrete value** to the requested percentile. In other words, the function returns the existing value in the dataset that is immediately lower than or equal to the requested percentile.

Therefore, `PERCENTILE_DISC(0.50)` returns the closest discrete value to the actual median of the quantity values, which is 89.

## Instructions

Let's assume the cell phone store wants to make informed decisions about inventory and marketing plans for next year's high-demand periods by identifying which months had high sales volumes. To do this, it's essential to **identify the months when sales volumes are in the top 25%** of all sales volumes.

Write a query against the `phone_sales_quantity_by_month` table to retrieve those months when the sales volume is greater or equal to the 75th percentile of quantity.

1. Use a subquery to calculate the 75th percentile value of the `quantity` column.

    * Utilize the `PERCENTILE_DISC()` function with a percentile value of 0.75.
    
    * Use the `WITHIN GROUP()` clause to sort the quantity values in ascending order before calculating the percentile value.

1. Select all columns from the `phone_sales_quantity_by_month` table in the main query.

1. Filter the main query results by comparing the `quantity` column with the calculated **75th percentile value**, selecting rows where the quantity is greater or equal to the percentile value.

In [None]:
%%sql
SELECT *       
FROM phone_sales_quantity_by_month

--comparar a coluna quantidade com o percentil top 25%

WHERE quantity >= (SELECT PERCENTILE_DISC(0.75) WITHIN GROUP(ORDER BY quantity) as percentile 
          FROM phone_sales_quantity_by_month)

# 6) Solving a Real-World Problem with the Distribution Window Functions

Assume the bike-sharing company will analyze the customer data to identify the most popular bike stations and the most popular routes by focusing on members. The result of the analysis allows the company to optimize bike availability at the popular rental stations and improve customer experience and retention.

You're asked to write an SQL query to return the most frequent start and end stations and the percentage of the rental by customer type.

As the company wants to allocate its resources on the routes where the members rent bikes more frequently,**you need to identify the top first percentile of the start and end station combination** where the members rent bikes.

Before writing the SQL query, it's essential to address one more clause related to window functions. Although it may be tempting to jump straight into writing code, understanding this clause will enhance your ability to use window functions effectively in real-world scenarios.

Let's explore this clause before diving into the code.

**The WINDOW Clause**

The SQL standard `WINDOW` clause is an alternative to the `OVER()` clause which helps to write more neat queries and avoid duplication in cases where multiple window functions with similar OVER clauses are used.

The general syntax of the `WINDOW` clause is as follows:

```sql
WINDOW window_name AS 
(
    [partition_definition] 
    [order_definition] 
    [frame_definition] 
)
```

The `WINDOW` clause is placed after the `GROUP BY` and `HAVING` clauses (if they are present) and before the `ORDER BY` clause in a `SELECT` query. It's also possible to include multiple `WINDOW` clauses within a query.

For example, the query below returns the average salary of each department.

```sql
SELECT *,
    AVG(salary) OVER win1
  FROM employees
WINDOW win1 AS (
       PARTITION BY department
);



## Instructions

To address the above requirement, write a query against the `tbl_bikeshare` table.

1. Write a CTE to calculate the percentile rank of each start and end station combination for each member type using the `PERCENT_RANK()` window function, rounded to **four decimal** places. Continuing within the CTE:

    * Group the rows in the table by `start_station`, `end_station`, and `member_type` using the `GROUP BY` clause, so the `COUNT()` function can be used to count the number of rentals for each combination.

    * Use the `WINDOW` clause to make the query neat and easy to follow.

    * Alias the new percentile rank column as `rental_percentile`.

    * Order your CTE result set by `rental_percentile` in descending order.

1. Select all columns from the CTE in the main query, including the start and end stations, the member type, and the percentile rank.

1. Filter the results to include only rentals made by members and start and end station combinations with a percentile rank greater than 0.99.

In [None]:
%%sql
WITH rental_combination AS (
    SELECT start_station, end_station, member_type,
           COUNT(*) AS count_combination, --contagem de combinações entre start , end station e member type
           ROUND(PERCENT_RANK() OVER rental_per::numeric,4) AS rental_percentile --percentil da contagem
           
      FROM tbl_bikeshare
     GROUP BY start_station, end_station, member_type --agrupar por start e end station e member type
    WINDOW rental_per AS (PARTITION BY member_type --cria uma janela referente ao percentil por member type
                          ORDER BY COUNT(*) ASC)
    ORDER BY rental_percentile DESC
    )
    
SELECT *
  FROM rental_combination
 WHERE rental_percentile > 0.99 AND member_type = 'Member'