# Olist's Net Promoter Score (NPS) üî•

The **Net Promoter Score (NPS)** of a service answers the following question:

> How likely is it that you would recommend our company/product/service to a friend or colleague?

For a service rated between 1 and 5 stars, like Olist, we can **classify customers into three categories** based on their answers:
- ‚úÖ **Promoters**: customers who answered  with a score of 5
- üò¥ **Passive**: customers who answered with a score of 4 
- üò° **Detractors**: customers who answered with a score between 1 and 3 (inclusive)

<br>

üëâ NPS is computed by subtracting the percentage of customers who are **detractors** from the percentage of customers who are **promoters**.

> NPS  
= % Promoters - % Detractors   
= (# Promoter - # Detractors) / # Reviews  
= (# 5 stars - # <4 stars) / # Reviews

### Import modules

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [2]:
root_path = os.path.join(os.getcwd(),'..')
if root_path not in sys.path:
    sys.path.append(root_path)

from utils.data import Olist
olist = Olist()
data = olist.get_data()

## Computing the Overall NPS Score of Olist

In [2]:
data = Olist().get_data()
orders = Order().get_training_data()


‚ùìCreate a function that converts `review_score` into `nps_class`. `nps_class` should be a **classification** depending on the `review_score`, so there are 3 possibilities:

- `review_score` is **5** üëâ `nps_class` is **1** (promoter)
- `review_score` is **4** üëâ `nps_class` is **0** (passive)
- `review_score` is **3** or less üëâ `nps_class` is **-1** (detractor)

In [3]:
def promoter_score(review_score):
    if review_score == 5:
        return 1
    elif review_score == 4:
        return 0
    return -1


In [6]:
nps = orders['review_score'].map(promoter_score).mean()
print(f'NPS = {nps*100:.1f}%')


NPS = 38.1%


üí°Let's try to rewrite this function into a single line of code that achieves the same result üòè

There are **several** ways to do it! Let's look at some of them, then we can compare their execution times to that of our function to see which one is more efficient ‚è±Ô∏è

Two general principles when it comes to programming/coding are:
- `KISS`: **K**eep **I**t **S**imple and **S**mart
- `DRY`: **D**on't **R**epeat **Y**ourself üòâ

<details>
    <summary>üí°Hint</summary>

Use the following methods and use `%time` to compare their execution times:
- `.apply()` with the function you wrote above
- `.map()` or `.apply()` with a `lambda` function
- `.loc[]` with boolean indexing
- `np.select()` with matching conditions

</details>    

In [8]:
# YOUR CODE HERE
orders['review_score'].apply(lambda x: 1 if x == 5 else (0 if x == 4 else -1)).head()


0    0
1    0
2    1
3    1
4    1
Name: review_score, dtype: int64

In [9]:
# Create boolean indexing masks
promoter = orders.review_score == 5
passive = orders.review_score == 4
detractor = orders.review_score < 4

orders.loc[promoter, 'promoter_score'] = 1
orders.loc[passive, 'promoter_score'] = 0
orders.loc[detractor, 'promoter_score'] = -1

orders['promoter_score'].head()


0    0.0
1    0.0
2    1.0
3    1.0
4    1.0
Name: promoter_score, dtype: float64

In [10]:
orders['review_score'].map({5: 1, 4: 0, 3: -1, 2: -1, 1: -1}).head()


0    0
1    0
2    1
3    1
4    1
Name: review_score, dtype: int64

In [11]:
%%time
# Even more concisely with np.select()
orders['promoter_class'] = np.select([orders.review_score >= 4], [orders.review_score - 4], -1)


CPU times: user 7.62 ms, sys: 662 ¬µs, total: 8.28 ms
Wall time: 6.68 ms


**A Note About `.apply()`**

Consider the following examples:

```python
df.apply(lambda col: col.max(), axis = 0)
df.apply(lambda row: row['A'] + row['B'], axis = 1)
```

These operations look similar because they both use `.apply()`, but one is much slower than the other. The data layout for Pandas DataFrames is **column-major** (read more [here](https://en.wikipedia.org/wiki/Row-_and_column-major_order)), which means that column-wise operations are always going to be faster than row-wise operations. The second example above uses `axis=1`, making it a row-wise operation, which would be more appropriate for **row-major** data layouts such as NumPy arrays.

For small amounts of data, this difference is irrelevant, but when you start working with huge datasets this will probably make a big difference. For big datasets, you're likely to notice that using `.loc[]`, `np.select()` or `np.apply_along_axis()` will run faster on Pandas DataFrames when applying a function on every row.

It's always good to understand how your data is stored before you access it!

üëá Now that you have the different promoter scores, you can compute `Olist's NPS`.

In [12]:
# YOUR CODE HERE
nps = orders['review_score'].map(promoter_score).mean()
print(f'NPS = {nps*100:.1f}%')


NPS = 38.1%


## NPS per Customer State

üëá Here is the part of Olist's DB schema that is relevant for this section, to help you have an overview of things.

<img src="https://wagon-public-datasets.s3-eu-west-1.amazonaws.com/04-Decision-Science/02-Statistical-Inference/olist_schema.png" width=750>

### What is the average review score per state?

‚ùìFirst, create the dataset required for computation

In [15]:
# YOUR CODE HERE
dataset = data['orders'].merge(data['customers'], on='customer_id') \
                        .merge(data['order_reviews'], on='order_id')
dataset.head()


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,review_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,a54f0611adc9ed256b57ede6b6eb5114,4,,"N√£o testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,af07308b275d755c9edb36a90c618231,47813,barreiras,BA,8d5266042046a06655c8db133d120ba5,4,Muito boa a loja,Muito bom o produto.,2018-08-08 00:00:00,2018-08-08 18:37:50
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,3a653a41f6f9fc3d2a113cf8398680e8,75265,vianopolis,GO,e73b67b67587f7644d5bd1a52deb1b01,5,,,2018-08-18 00:00:00,2018-08-22 19:07:58
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00,7c142cf63193a1473d2e66489a9ae977,59296,sao goncalo do amarante,RN,359d03e676b3c069f62cadba8dd3f6e8,5,,O produto foi exatamente o que eu esperava e e...,2017-12-03 00:00:00,2017-12-05 19:21:58
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00,72632f0f9dd73dfee390c9b22eb56dd6,9195,santo andre,SP,e50934924e227544ba8246aeb3770dd4,5,,,2018-02-17 00:00:00,2018-02-18 13:02:51


üëâ Now, we can aggregate this dataset per  `customer_state` using any aggregation method of our choice :)

‚ùì Let's start with the average review score: compute the average `review_score` per `customer_state`.

*Hints:* try to tackle this question using three different methods:
- with `.mean()`
- then with `.apply()`
- and eventually the `.agg()`

In [17]:
# YOUR CODE HERE
dataset.groupby(by='customer_state')['review_score'].mean().head()


customer_state
AC    4.049383
AL    3.751208
AM    4.183673
AP    4.194030
BA    3.860888
Name: review_score, dtype: float64

In [21]:
# YOUR CODE HERE
dataset.groupby(by='customer_state')['review_score'].apply('mean').head()


customer_state
AC    4.049383
AL    3.751208
AM    4.183673
AP    4.194030
BA    3.860888
Name: review_score, dtype: float64

In [25]:
# YOUR CODE HERE
dataset.groupby(by='customer_state').agg({'review_score': ['mean', 'max'], 'customer_zip_code_prefix': pd.Series.count}).head()


Unnamed: 0_level_0,review_score,review_score,customer_zip_code_prefix
Unnamed: 0_level_1,mean,max,count
customer_state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
AC,4.049383,5,81
AL,3.751208,5,414
AM,4.183673,5,147
AP,4.19403,5,67
BA,3.860888,5,3357


ü§© `.agg()` is much more flexible than the other methods, push it further!

In [None]:
# YOUR CODE HERE
agg_test.col
agg_test.loc('review_score', max)


### NPS per State

‚ùìNow, it is time to create a üî• **custom aggregation function** to compute the `NPS per customer_state` directly.

1Ô∏è‚É£ Create your `nps` function

2Ô∏è‚É£ Try to debug it using the `breakpoint()` debugger within your function to understand clearly what objects you are manipulating

<br>

üí° *PS.:* always **cleanly** exit your debugger by typing `exit` when inside the debugging session, otherwise you will have to restart your Notebook!

In [29]:
# YOUR CODE HERE
def nps(series):
    return series.map(promoter_score) / series.count()


üëâ Now, use your `nps` function to compute the `NPS per customer_state`.

In [30]:
# YOUR CODE HERE
dataset.groupby(by='customer_state')['review_score'].apply(nps)


0        0.000000
1        0.000000
2        0.000494
3        0.002075
4        0.000024
           ...   
99219    0.000024
99220    0.000000
99221    0.000298
99222   -0.000078
99223    0.000198
Name: review_score, Length: 99224, dtype: float64

Again, instead of using this function, try to do the same task in one line of code, remember the `KISS` principle? üòâ

In [32]:
# YOUR CODE HERE
dataset.groupby(by='customer_state')['review_score'].apply('count')


customer_state
AC       81
AL      414
AM      147
AP       67
BA     3357
CE     1329
DF     2148
ES     2016
GO     2024
MA      746
MG    11625
MS      724
MT      903
PA      968
PB      531
PE     1646
PI      491
PR     5038
RJ    12765
RN      482
RO      252
RR       46
RS     5483
SC     3623
SE      349
SP    41690
TO      279
Name: review_score, dtype: int64

# Cheat Sheet for `map`, `apply`, `applymap` and `groupby`

```python
# MAP (for Series)
series.map(function) 
series.map({mapping dict})

# APPLY (for DataFrame)
df.apply(lambda col: col.max(), axis = 0)     # default axis
df.apply(lambda row: row[‚ÄòA‚Äô] + row[‚ÄòB‚Äô], axis = 1)

df.applymap(my_funct_for_indiv_elements)
df.applymap(lambda x: '%.2f' % x)
```

```python
## GROUPBY
group = df.groupby('col_A')

group.mean()
group.apply(np.mean)
group.agg({
    col_A: ['mean', np.sum],
    col_B: my_custom_sum,
    col_B: lambda s: my_custom_sum(s)
})

group.apply(custom_mean_function)
```

[Introduction to Pandas' `apply`, `applymap` and `map` - Towards Data Science](https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff)