# Analyzing Net Promoter Score (NPS) data with SQL

## Goals (2 min)

In this case, we will extend your prior SQL query knowledge to more advanced querying. Specifically, we will focus on queries that involve computations across tables. We will also leverage ```PostgreSQL```'s advanced aggregating functions to perform analysis directly using SQL.

During the case, you will accomplish a non-trivial statistical computation directly in the database. While this will be done on your local machine, this computation can also be done in the cloud.

## Introduction (5 min)

**Business Context.** You are a data scientist at a new but fast-growing startup. The startup released its first product 12 months ago and has been tracking Net Promoter Score (NPS) over its growing customer base since the product's launch.

The team assumes that the NPS score is correlated to the product stability and feature-completeness and that the product has been getting more stable and complete over time. They also realize that there have been some hiccups along the way, and they assume that NPS has therefore fluctuated up and down.

**Business Problem.** The startup wants you to investigate the data and answering the following question: **"Has our NPS improved over time? And has our average NPS decreased in specific periods over the last 12 months?"**

**Analytical Context.** In this case, you will be working with a dataset that the startup has been storing since its inception. They decided to use a ```PostgreSQL``` database because it offered advanced statistical functions. As it is a startup, they only have enough funds to give you a computationally-limited machine; however, their database server is very good. You'll have to connect to the ```PostgreSQL``` database and have the database run the resource-intensive queries.

Specifically, you will: (1) get familiarized with what NPS is and some of its properties; (2) set up a PostgreSQL database on your local machine and import the given dataset into it; (3) use advanced SQL features to calculate NPS and NPS statistics.

## Understanding the Net Promoter Score (NPS) (10 min)

NPS is a metric to measure customer satisfaction. You've probably seen pop-ups online, or received surveys via email, asking you "Would you recommend [product] to a friend or family member?" and giving you the option to respond with a number between 0 and 10. That's someone collecting information to calculate their NPS.

![nps Example Survey](images/nps-example-survey.png)

The basic idea is simple - customers who respond with high ratings are more likely to promote your product to other potential customers. Customers who give low ratings are unhappy and are unlikely to help you grow your customer base. If you ask enough people at different time periods, you can track customer satisfaction over time and see how this correlates to product development and other aspects of your business that are within your control. 

NPS categorizes users into three groups based on the ratings that they leave. This is done as follows:

1. Users who leave a rating of 0 - 6 are regarded as "detractors"
2. Users who leave a rating of 7 or 8 are regarded as "passives"
3. Users who leave a rating of 9 or 10 are regarded as "promoters"

The final NPS score for a given period is calculated as the percentage of total users who are promoters minus the percentage of total users who are detractors. This means that an NPS score can be anything from -100 to 100.

### Connecting to ```PostgreSQL``` (5 min)

We will be using `SQLAlchemy` to perform all the SQL commands. However, this time we will be connecting to a PostgreSQL database. To do this, use the following cell which contains boilerplate code that will handle establishing the connection.

**Note:** If you are comfortable using the command line and the ```psql``` utility, you are free to do this instead. This will also be taught in a future case to connect to the cloud.

In [None]:
import pandas as pd
import os

from sqlalchemy import create_engine, text

#maximum number of rows to display
pd.options.display.max_rows = 20

engine=create_engine('postgresql://localhost/postgres', max_overflow=20)

def runQuery(sql):
    result = engine.connect().execution_options(isolation_level="AUTOCOMMIT").execute((text(sql)))
    return pd.DataFrame(result.fetchall(), columns=result.keys())

def setup():
    customer_file = os.path.abspath("./customer.csv")
    score_file = os.path.abspath("./score.csv")

    return runQuery("""
    CREATE TABLE customer (id serial not null, created_at date, is_premier boolean, is_spam boolean, CONSTRAINT customer_pkey PRIMARY KEY (id));
    CREATE TABLE score (id serial not null, customer_id integer references customer(id), created_at date, score integer, CONSTRAINT scores_pkey PRIMARY KEY (id));
    COPY customer FROM '""" + customer_file + """' WITH (format csv, header true, delimiter ',');
    COPY score FROM '""" + score_file + """' WITH (format csv, header true, delimiter ',');
    SELECT * FROM customer LIMIT(5);
    """)
    
def cleanup():
    runQuery("""
    DROP TABLE customer CASCADE;
    DROP TABLE score CASCADE;
    SELECT 0 WHERE FALSE; -- prevents SQLAlchemy from throwing an error
    """)

## Loading the data (8 min)

Now that we have a PostgreSQL database set up, we will now load in some data. In this particular case, we'll use the code at [this repository](https://github.com/sixhobbits/nps-sample-data) to generate a large sample of fake NPS data and push it to a PostgreSQL instance on your local machine. To set everything up, we just need to run the ```setup()``` command. At the end, we will run the ```cleanup()``` command to return your PostgreSQL database to its original state. If the following command works, you should see the first 5 rows of the ```customer``` table.

**Note:** If you receive an error about permissions, then there is an issue with access to the file. You will need to make sure the file and directory is readable by other users. For Unix users, use `chmod`. Windows users can follow the instructions [here](https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/cc754344(v=ws.11)?redirectedfrom=MSDN).

In [None]:
setup()

### Data format (5 min)

The tables that we just imported into PostgreSQL have the following column details:

1. `customers`
      * **id** INT, primary key of the table
      * **created_at** DATE, date the record was creaed
      * **is_premier** BOOLEAN, if the customer is a premier customer
      * **is_spam** BOOLEAN, if a customer is a spam customer
      

2. `score`, scores of the various surveys
      * **id** INT, primary key of the table
      * **customer_id** FOREIGN KEY, which customer completed the survey
      * **created_at** DATE, when the customer did the survey
      * **score** INT, the score of the survey

The database can be visualised as below:

![Database Schema](./images/database_schema.png)

The important context is that we are imagining a scenario where:

* We have been running a new company for around one year
* The product has gone through different stages of feature improvement and stability but has overall shown growth and improvement
* Every day, new customers join and both new and old customers may or may not leave us a score between 0 - 10 to rate how likely they are to recommend our product to family and friends
* At the start and at some key points during the year, the product is unstable or lacking features and this affects the customer rating

## Analyzing our NPS data using SQL (55 min)

Now we can proceed to the fun part. We have NPS scores left by a large number of customers over the past year, and we want to see how these scores change over time.

We only have raw data - numbers between 0 and 10 inclusive – so we'll use SQL to group this data in different ways and transform it into NPS data. If you remember how to define NPS from the first section, you can probably work out that the main things we need to do are:

1. Break down our scores per customer for any given time period (here, we will look at this per week)
2. Divide customers into promoters, passives or detractors, based on the scores they have left in that week
3. Calculate the NPS per week and look at how this value changes week-by-week

### Counting customers and scores (10 min)

We saw how many customers and scores we had when we did the import step above. However, in a real-world setting, you would have gathered this data slowly, over time, so let's start by counting out customers, our survey responses (`scores`), and looking at how many surveys each customer responds to.

#### Counting customers

In [None]:
runQuery("""
SELECT COUNT(*) FROM customer;""")

We have nearly 200k customers, which is not bad for a product that's been running for one year!

#### Counting scores

In [None]:
runQuery("""
SELECT COUNT(*) FROM score;""")

And we have over 1.5 million survey responses. That's just over 8 responses per customer if we assume an equal distribution. Let's use SQL to look at that.

### Exercise 1: (5 min)

Write a SQL query that outputs a table showing the 10 customers with the highest number of responses and their total response count, in descending order (customer with most responses at the top).

**Answer.** One possible solution is given below:

In [None]:
runQuery("""
SELECT customer_id, COUNT(score.id) AS cnt FROM score
INNER JOIN customer ON customer_id = customer.id
GROUP BY customer_id ORDER BY cnt DESC
LIMIT 10;""")

We can see that the top three places have customer IDs `31`, `928` and `4271`, each having left 38 survey responses. 

You might also be used to doing SQL `JOIN`s using commas and a `WHERE` clause as a shortcut. The above command is equivalent to the following one, but the earlier version is preferable in most contexts as it is more explicit:

In [None]:
runQuery("""
SELECT customer_id, COUNT(score.id) AS cnt FROM score, customer
WHERE customer_id = customer.id
GROUP BY customer_id ORDER BY cnt DESC
LIMIT 10;""")

We can see there are at least 10 customers who have left only a single response. Let's do a 'count of counts' query to get a better idea of how many responses most customers leave. We want to count how many customers have left exactly $x$ responses. 

### Nested queries (15 min)

Before proceeding to the next step, let's take a look at something called **nested queries** or **subqueries**. Nested queries are SQL queries performed inside another SQL query. Usually it involves using the result of the inner query in the `WHERE` clause of the outer query:

![Nested Queries](./images/nested_queries.png)

The innermost query is executed first and the results of the inner queries are fed as parameters to the outer query. For example, in the current database, in order to find the scores provided by premier customers, we need 2 queries - one to find the IDs of all the premier customers in our database, and another to filter the scores given by that particular list of customers. This process can be implemented as follows:

```
SELECT * from score 
WHERE customer_id IN (
    SELECT id from customers where is_premier='t';
)
```

This first executes the inner query to get the IDs of all the premier customers. That list of IDs is passed as a parameter to the outer query, in order to filter the scores provided by the particular IDs.

### Exercise 2: (10 min)

Write a SQL query that outputs a table showing how many customers leave $x$ responses for any given integer $x$. Sort this table in descending order ($x$ with highest number of customers leaving $x$ responses at the top).

**Answer.** One possible solution is given below:

In [None]:
runQuery("""
SELECT cnt, COUNT(cnt) as count_of_count FROM
(
    SELECT customer_id, count(score.id) AS cnt FROM score
    INNER JOIN customer ON customer_id = customer.id
    GROUP BY customer_id
) a
GROUP BY cnt
ORDER BY count_of_count DESC
LIMIT 100;""")

Notice in the query above we have given the intermediate query an **alias**, which comes immediately after the closing parenthesis. In this case, we have chosen the alias `a`. It is a common convention to use aliases `a`, `b`, `c`, etc. as a shorthand if you are primarily interested only in the final result.

From our previous query, we already know that all the values have to fall between 1 and 38, so there can be a maximum of 38 rows returned in this query. Therefore there is no real need to add a `LIMIT` clause, but we add a `LIMIT 100` anyway. This is a good habit in case you make a wrong assumption about the likely size of your output, in order to prevent a situation where you accidentally try to pull thousands or millions of rows from a remote server.

We can see that most customers leave between 2 and 10 responses, so the maximum of 38 is an outlier. A fair number of people only leave one response.

### Average scores per week (10 min)

However, we still have not looked at how scores are *changing*. Let's average all scores in each week and see how the scores go up and down over time:

In [None]:
runQuery("""
SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, AVG(score) AS avg_score
FROM score
GROUP BY week
ORDER BY week ASC
LIMIT 100;""")

Again, we did not need to add a limit clause as we know there will only be 52 rows (the number of weeks in a year, which is the span of our dataset), but we do anyway for good measure.

We can see that the scores start low and generally trend up over time, although they go down again around week 36 (not shown above). We use the [ISO Week](https://en.wikipedia.org/wiki/ISO_week_date) through PostgreSQL's `TO_CHAR` function to break down each of our dates into a specific week number and average the scores per week. 

There are a couple issues with the above query, though:

1. The `AVG` function shows a lot of decimal points by default which makes it more difficult to read the data
2. Many customers leave a different number of responses and some might leave more than one response per week

A good compromise is to calculate the average score per customer per week, then average all of these to get an average score across all customers per week. Let's do this and round off some decimal points to make our data easier to read.

### Exercise 3: (7 min)

Write a query to compute the average score across all customers per week, rounding off to two decimal places.

**Hint:** Use the `ROUND()` function, which takes two arguments: the quantity you are rounding, and how many decimals you are rounding off to.

**Answer.** One possible solution is given below:

In [None]:
runQuery("""
SELECT week, ROUND(AVG(avg_week_score),2) as avg_score FROM
(
    SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
    GROUP BY week, customer_id
) a
GROUP BY week
ORDER BY week
LIMIT 100;""")

### Classifying our customers as promoters, passives, or detractors (10 min)

Now, let's proceed to classifying our customers so we can calculate the NPS per week. We used a similar `SELECT` (two deep this time!) and a `CASE` statement. For us, anything larger than an 8 (i.e. 9 or 10) is a promoter, otherwise, anything larger than a 6 (i.e. 7 or 8) is a passive and everything else is a detractor:

In [None]:
runQuery("""
SELECT * FROM
(
    SELECT CASE
        WHEN avg_week_score > 8 THEN 'promoter'
        WHEN avg_week_score > 6 THEN 'passive'
        ELSE 'detractor'
    END AS nps_class, week FROM
    (
        SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
        GROUP BY week, customer_id
    ) a
) b
limit 10;""")

This is closer to what we need, but not very useful in its current form. We can confirm that there are still nearly a million rows by using another `COUNT`:

In [None]:
runQuery("""
SELECT count(*) FROM
(
    SELECT CASE
        WHEN avg_week_score > 8 THEN 'promoter'
        WHEN avg_week_score > 6 THEN 'passive'
        ELSE 'detractor'
    END AS nps_class, week FROM
    (
        SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
        GROUP BY week, customer_id
    ) a
) b
LIMIT 10;""")

Note that we also had to add another intermediate alias `b` to our SQL code, as we have yet another level of nested `SELECT`.

Now that we've broken our customers into specific categories, we want to count them. It's useful to "pivot" this data so that we can see the count of each class of people as a separate column. In a spreadsheet program like Microsoft Excel or Google Sheets, we would think of this as a pivot table, and there are plugins for PostgreSQL to allow you to use it in a similar way. In our case, though, we can count the number of each class each week using some more `CASE` statements and the `SUM` function as follows:

In [None]:
runQuery("""
SELECT week,
SUM(CASE WHEN nps_class = 'promoter' THEN 1 ELSE 0 END) AS "promoter",
SUM(CASE WHEN nps_class = 'passive' THEN 1 ELSE 0 END) AS "passive",
SUM(CASE WHEN nps_class = 'detractor' THEN 1 ELSE 0 END) AS "detractor",
COUNT(*) AS "total" FROM
(
    SELECT CASE
        WHEN avg_week_score > 8 THEN 'promoter'
        WHEN avg_week_score > 6 THEN 'passive'
        ELSE 'detractor'
    END AS nps_class, week FROM
    (
        SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
        GROUP BY week, customer_id
    ) a
) b
GROUP BY week
ORDER BY week
limit 100;""")

### Calculating NPS per week (10 min)

We now have all the pieces in place to calculate our NPS. To do this, we will have to use a *third* nested `SELECT` and yet another table alias `c`.

### Exercise 4: (10 min)

Given the above guidance, write the query to compute NPS per week.

**Answer.** One possible solution is given below:

In [None]:
runQuery("""
SELECT *, ROUND(((CAST(promoter AS DECIMAL) / total) - (CAST(detractor AS DECIMAL) / total)) * 100, 0) AS nps FROM
(
    SELECT week,
    SUM(CASE WHEN nps_class = 'promoter' THEN 1 ELSE 0 END) AS "promoter",
    SUM(CASE WHEN nps_class = 'passive' THEN 1 ELSE 0 END) AS "passive",
    SUM(CASE WHEN nps_class = 'detractor' THEN 1 ELSE 0 END) AS "detractor",
        COUNT(*) AS "total" FROM

    (
        SELECT CASE
        WHEN avg_week_score > 8 THEN 'promoter'
        WHEN avg_week_score > 6 THEN 'passive'
        ELSE 'detractor'
        END AS nps_class, week FROM
        (
            SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
            GROUP BY week, customer_id
        ) a
    ) b
GROUP BY week
ORDER BY week
) c
limit 100;""")

That first line is not pretty, but it works! We can now see the NPS, correctly rounded, for any given week.

## Calculating how NPS has changed over time in SQL (40 min)

Now that we have NPS calculate, we want to see how the NPS has changed through time. Specifically, we are interested in which months the NPS decreased. The team release new features the first day of every month so we want to see how the features may have changed the NPS for that month. To do this we are going to use the more advanced statistical aggregate functions offered by ```PostgreSQL```.

The first thing we need to do is change the above query so that it includes which month it is so that we can ```GROUP BY``` the month to do our analysis. This is easily accomplished by adding a new property using ```CEIL``` and ```SUBSTRING``` to our top-level ```SELECT```:

In [None]:
runQuery("""
SELECT *, ROUND(((CAST(promoter AS DECIMAL) / total) - (CAST(detractor AS DECIMAL) / total)) * 100, 0) AS nps, CEIL(CAST(SUBSTRING(week, 6, 8) AS DECIMAL)*12/52) AS month FROM
(
    SELECT week,
    SUM(CASE WHEN nps_class = 'promoter' THEN 1 ELSE 0 END) AS "promoter",
    SUM(CASE WHEN nps_class = 'passive' THEN 1 ELSE 0 END) AS "passive",
    SUM(CASE WHEN nps_class = 'detractor' THEN 1 ELSE 0 END) AS "detractor",
        COUNT(*) AS "total" FROM

    (
        SELECT CASE
        WHEN avg_week_score > 8 THEN 'promoter'
        WHEN avg_week_score > 6 THEN 'passive'
        ELSE 'detractor'
        END AS nps_class, week FROM
        (
            SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
            GROUP BY week, customer_id
        ) a
    ) b
GROUP BY week
ORDER BY week
) c
limit 100;""")

Now that we have the month included, let's calculate some statistics about how the NPS changes for each month. The team at the startup is interested in which months had the mean for the month move by more than 1.5 standard deviations.

### Exercise 5: (15 min)

Write two SQL queries that will output (1) the first NPS score of the month and (2) the mean NPS of that month and its standard deviation.

**Hint:** the aggregate functions of ```PostgreSQL``` can be found [here](https://www.postgresql.org/docs/9.5/functions-aggregate.html). You might also need to use a ```JOIN```.

**Answer.** One possible solution is given below:

In [None]:
runQuery("""
SELECT stats.month, AVG(stats.nps) as nps_avg, STDDEV(stats.nps) as nps_std FROM
(
    SELECT *, ROUND(((CAST(promoter AS DECIMAL) / total) - (CAST(detractor AS DECIMAL) / total)) * 100, 0) AS nps, CEIL(CAST(SUBSTRING(week, 6, 8) AS DECIMAL)*12/52) AS month FROM
    (
        SELECT week,
        SUM(CASE WHEN nps_class = 'promoter' THEN 1 ELSE 0 END) AS "promoter",
        SUM(CASE WHEN nps_class = 'passive' THEN 1 ELSE 0 END) AS "passive",
        SUM(CASE WHEN nps_class = 'detractor' THEN 1 ELSE 0 END) AS "detractor",
            COUNT(*) AS "total" FROM

        (
            SELECT CASE
            WHEN avg_week_score > 8 THEN 'promoter'
            WHEN avg_week_score > 6 THEN 'passive'
            ELSE 'detractor'
            END AS nps_class, week FROM
            (
                SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
                GROUP BY week, customer_id
            ) a
        ) b
    GROUP BY week
    ORDER BY week
    ) c
) stats
GROUP BY month
limit 100;""")

The below diagram shows the order in which the nested queries are executed for the example given above. In general, the innermost query is always executed first and its results are passed on to the next innermost query, etc.

![Nested Query Workflow](./images/nested_workflow.png)

In [None]:
runQuery("""
SELECT DISTINCT ON (month)
month, nps FROM
(
    SELECT *, ROUND(((CAST(promoter AS DECIMAL) / total) - (CAST(detractor AS DECIMAL) / total)) * 100, 0) AS nps, CEIL(CAST(SUBSTRING(week, 6, 8) AS DECIMAL)*12/52) AS month FROM
    (
        SELECT week,
        SUM(CASE WHEN nps_class = 'promoter' THEN 1 ELSE 0 END) AS "promoter",
        SUM(CASE WHEN nps_class = 'passive' THEN 1 ELSE 0 END) AS "passive",
        SUM(CASE WHEN nps_class = 'detractor' THEN 1 ELSE 0 END) AS "detractor",
            COUNT(*) AS "total" FROM

        (
            SELECT CASE
            WHEN avg_week_score > 8 THEN 'promoter'
            WHEN avg_week_score > 6 THEN 'passive'
            ELSE 'detractor'
            END AS nps_class, week FROM
            (
                SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
                GROUP BY week, customer_id
            ) a
        ) b
    GROUP BY week
    ORDER BY week
    ) c
) first_nps
ORDER BY month
limit 100;""")

Of course, we can do the calculation manually to see the results because we only have 12 months in this example. However, we want to make this process automated. We can ```JOIN``` these two tables and do the desired calculation.

Before we do that, however, you may have noticed that these SQL statements are getting a bit unwieldly. The fact that these queries have identical inner ```SELECT```s means we can share that across both queries so that our ```JOIN``` becomes more compact. This is accomplished by making a ```VIEW``` with the ```WITH ... AS``` syntax, which allows you to essentially save a subquery as a variable:

In [None]:
runQuery("""
WITH nps_weekly AS 
( 
    SELECT *, ROUND(((CAST(promoter AS DECIMAL) / total) - (CAST(detractor AS DECIMAL) / total)) * 100, 0) AS nps, CEIL(CAST(SUBSTRING(week, 6, 8) AS DECIMAL)*12/52) AS month FROM
    (
        SELECT week,
        SUM(CASE WHEN nps_class = 'promoter' THEN 1 ELSE 0 END) AS "promoter",
        SUM(CASE WHEN nps_class = 'passive' THEN 1 ELSE 0 END) AS "passive",
        SUM(CASE WHEN nps_class = 'detractor' THEN 1 ELSE 0 END) AS "detractor",
            COUNT(*) AS "total" FROM

        (
            SELECT CASE
            WHEN avg_week_score > 8 THEN 'promoter'
            WHEN avg_week_score > 6 THEN 'passive'
            ELSE 'detractor'
            END AS nps_class, week FROM
            (
                SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
                GROUP BY week, customer_id
            ) a
        ) b
    GROUP BY week
    ORDER BY week
    ) c
)

SELECT *, (first_nps-nps_avg)/nps_std as num_std FROM
(
    (
        SELECT a.month, AVG(a.nps) as nps_avg, STDDEV(a.nps) as nps_std FROM
        (
            SELECT * FROM nps_weekly
        ) a
        GROUP BY month
    ) stats

    INNER JOIN
    (
        SELECT DISTINCT ON (month)
        month, nps as first_nps FROM nps_weekly
    ) first_nps

    ON (stats.month = first_nps.month)
)
WHERE (nps_avg-first_nps)/nps_std < -1.5
""")

We can see from this result that the features introduced in month 9 may have been responsible for a drop in NPS. However, being a data scientist, you know that calculating standard deviation on 4 or 5 numbers may not be the best method. Thus, you decide that doing a regression (or line of best fit) on the weekly data is better to see if the NPS decreased or increased in a given month.

### Exercise 6: (15 min)

Using ```regr_slope(Y, X)``` and the ```nps_weekly``` view, calculate the weekly slope of the NPS scores and return the slopes that are positive. [Here](https://www.postgresql.org/docs/9.5/functions-aggregate.html) is the documentation for ```regr_slope(Y, X)```.

**Answer.** One possible solution is given below:

In [None]:
runQuery("""
WITH nps_weekly AS 
( 
    SELECT *, ROUND(((CAST(promoter AS DECIMAL) / total) - (CAST(detractor AS DECIMAL) / total)) * 100, 0) AS nps, CEIL(CAST(SUBSTRING(week, 6, 8) AS DECIMAL)*12/52) AS month FROM
    (
        SELECT week,
        SUM(CASE WHEN nps_class = 'promoter' THEN 1 ELSE 0 END) AS "promoter",
        SUM(CASE WHEN nps_class = 'passive' THEN 1 ELSE 0 END) AS "passive",
        SUM(CASE WHEN nps_class = 'detractor' THEN 1 ELSE 0 END) AS "detractor",
            COUNT(*) AS "total" FROM

        (
            SELECT CASE
            WHEN avg_week_score > 8 THEN 'promoter'
            WHEN avg_week_score > 6 THEN 'passive'
            ELSE 'detractor'
            END AS nps_class, week FROM
            (
                SELECT TO_CHAR(score.created_at, 'IYYY-IW') AS week, customer_id, AVG(score) as avg_week_score FROM score
                GROUP BY week, customer_id
            ) a
        ) b
    GROUP BY week
    ORDER BY week
    ) c
)

SELECT * FROM
(
    SELECT month, regr_slope(nps, CAST(SUBSTRING(week, 6, 8) AS DECIMAL)) as slope FROM nps_weekly
    GROUP BY month
) slopes
WHERE slope < 0

""")

This result reaffirms that September had a strong reduction in NPS score. However, we also see that months 4, 5, and 7 have negative slopes that may warrant further research.

### Cleanup

Now that we are done with the case, let's clean up your local PostgreSQL database so that it is not polluted with the tables used in this case study:

In [None]:
cleanup()

## Conclusions (5 min)

In this case, you learned about the Net Promoter Score (NPS) metric and set up a local `PostgreSQL` database. You also learned how to write complex queries in SQL that could be run directly in the database. These queries used advanced features like nested `SELECT` statements and `CASE` statements which can be combined in intricate ways to get the results you need directly from your database. They also leveraged `PostgreSQL`-exclusive aggregate functions like `regr_slope` to do direct basic statistical analysis.

We found that there was a general increase in NPS over time; however, starting in September there was a significant downturn in average NPS score. It is likely that the product encountered some significant bugs or outages during this time and going forward we should check if anything was recorded by the startup's product team to confirm this.

## Takeaways (2 min)

Although SQL is often seen as "simple" and discarded in favor of new-age languages like Python, basic SQL building blocks, such as `SELECT`, `WHERE`, and `CASE` can be joined to build up sophisticated queries that are highly efficient in comparison to trying to do the same thing in Python. You can do all sorts of things directly in the database that you could originally only do on your local machine.