# Statistics

In [None]:
# Run this cell to load the tdquiz magic command, that runs queries and return the results.
%load_ext tdquiz

# Test tdquiz magic command
%tdquiz SELECT USER, SESSION, CURRENT_TIMESTAMP

<h3>Increase rate from the past year</h3><span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">arithmetic-function</span> <span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">logarithm</span> <span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">math</span>
<br>
There are several definitions for the increasing rate. The followings are often used:
<ol>
<li> y<sub>t</sub> / y<sub>t-1</sub> - 1
<li> 1 - y<sub>t-1</sub> / y<sub>t</sub>
<li> log<sub>e</sub>(y<sub>t</sub>) - log<sub>e</sub>(y<sub>t-1</sub>)
</ol>
The first one, which perhaps is the most popular, calculate the ratio of the change compared to the previous value.
The second one takes the next value as the comparison basis.
The third one uses the natural logarithm to approximate the two, and it is symmetric between the two values.
<br />
These definition takes a similar values so long as the change is not so drastic.
<br />
For TeraShirt's monthly store revenues, calculate the three increase rates from 2021 to 2022, compared by month.
Please omit the month where one of the revenue does not exist.

<details><summary>Answer</summary><pre style="margin: 1em 0; padding: 1em; border-radius: 5px; background-color: #25292f; white-space: pre-wrap;"><code style="background-color: #25292f; color: #ffffff">with tmp AS (
SELECT
  EXTRACT(MONTH FROM purchase_date) AS sales_month,
  EXTRACT(YEAR FROM purchase_date) AS sales_year,
  SUM(sales_value) AS revenue
FROM
  TeraShirt.store_sales
WHERE
  sales_year IN (2021, 2022)
GROUP BY 1,2
)
SELECT
  COALESCE(a.sales_month, b.sales_month) AS "month",
  1.00 * b.revenue / a.revenue - 1 AS inc_rate1,
  1 - 1.00 * a.revenue / b.revenue AS inc_rate2,
  LN(b.revenue) - LN(a.revenue) AS inc_rate3
FROM 
  tmp AS a
  INNER JOIN tmp AS b
    ON a.sales_month = b.sales_month
       AND a.sales_year = 2021 AND b.sales_year = 2022
ORDER BY 1</code></pre></details>

In [None]:
%%tdquiz
/* Write your query blow */






<h3>Frequent buyers</h3><span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">median</span> <span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">aggregation</span> <span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">window-function</span>
<br>
TeraShirt company is planning to implement different marketing plans by grouping the registered customers by their purchase frequencies.
To this end, they have decided to explore the number of days between consecutive purchases by customers. 
Note that we count only one purchase per day for each customer.
<br />
To compare the purchase frequencies across customers, please derive the number of purchases, the average and the median of the days between purchases.
Finally, display them in increasing order by the median (In case of ties, sort them by the average and the customer ID). 
Output columns will be the customer ID, the customer name (last name and first name concatenated by a space), the average days and the median days.
<br />
Those with smaller median values are more frequent buyers. Are them a large difference among customers?

<details><summary>Answer</summary><pre style="margin: 1em 0; padding: 1em; border-radius: 5px; background-color: #25292f; white-space: pre-wrap;"><code style="background-color: #25292f; color: #ffffff">WITH tmp AS (
  SELECT
    purchase_date,
    purchase_date - LAG(purchase_date) OVER (PARTITION BY customer_id ORDER BY purchase_date) AS duration,
    customer_id
  FROM
    ( SELECT DISTINCT purchase_date, customer_id FROM terashirt.store_sales WHERE customer_id IS NOT NULL ) AS a
)
SELECT
  b.customer_id,
  b.last_name || ' ' || b.first_name AS customer_name,
  count(*) AS n_visit,
  avg(duration) AS average_days,
  median(duration) AS median_days
FROM
  tmp
  INNER JOIN terashirt.customer AS b ON tmp.customer_id = b.customer_id
GROUP BY 1,2
HAVING count(*) >= 10
ORDER BY 5,4,1</code></pre></details>

In [None]:
%%tdquiz
/* Write your query blow */






<h3>Comparison of duration between purchases</h3><span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">survival-analysis</span> <span class="tag" style="font-size: smaller; background-color: #dddddd; color: #222222">cumulative-distribution</span>
<br>
TeraShirt company is visualizing the distribution of duration between the purchase per customers.
Here, we take two customers 2 and 54 as examples.
<br />
For each customer, calculate the duration between two cosecutive purchases (purchases at the same day count only once),
and then for each duration, calculate "the ratio where the duration is smaller than or equal to that duration."
Define the ratio as the float type variable.
<br />
The output will be customer ID, duration between purchases, and the ratio where the duration is smaller than or equal to that duration,
sorted by customer ID and the duration.
<br />
Optional: Using the output table, create a graph where the x-axis is the duration and y-axis is the ratio for each customer.
Statistically this is called the cumulative distribution function.
How would you compare the purchase duration of the two cutomers from the graph?

<details><summary>Answer</summary><pre style="margin: 1em 0; padding: 1em; border-radius: 5px; background-color: #25292f; white-space: pre-wrap;"><code style="background-color: #25292f; color: #ffffff">with tmp AS (
  SELECT DISTINCT customer_id, purchase_date
  FROM TeraShirt.store_sales
  WHERE customer_id IN (2, 54)
)
,tmp2 AS (
  SELECT
    customer_id,
    purchase_date - LAG(purchase_date) OVER (
    PARTITION BY customer_id ORDER BY purchase_date
  ) AS duration
  FROM tmp
)
,tmp3 AS (
  SELECT customer_id, duration, COUNT(*) AS freq
  FROM tmp2
  WHERE duration IS NOT NULL
  GROUP BY 1,2
)
SELECT
  customer_id, duration,
  CAST(1 AS FLOAT) * SUM(freq) OVER (
    PARTITION BY customer_id
    ORDER BY duration DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) / SUM(freq) OVER (PARTITION BY customer_id)
  AS cum_ratio
FROM
 tmp3
ORDER BY 1,2</code></pre></details>

In [None]:
%%tdquiz
/* Write your query blow */




