# Orders

✏️ **Exercise**

Today, we will investigate the **orders**, and their associated review score.

For that purpose, we will create one single data table containing **all our orders with some engineered statistics for them as additional columns.**

👉 Our goal is to create a DataFrame with the following features:

*- it will bcome in very handy for our modeling phase -*


| feature_name 	| type 	| description 	|
|:---	|:---:	|:---	|
| `order_id` 	| str 	| the id of the order 	|
| `wait_time` 	| float 	| the number of days between order_date and delivered_date 	|
| `expected_wait_time` 	| float 	| the number of days between order_date and estimated_delivery_date 	|
| `delay_vs_expected` 	| float 	| if the actual delivery date is later than the estimated delivery date, returns the number of days between the two dates, otherwise return 0 	|
| `order_status` 	| str 	| the status of the order 	|
| `dim_is_five_star` 	| int 	| 1 if the order received a five-star review, 0 otherwise 	|
| `dim_is_one_star` 	| int 	| 1 if the order received a one_star, 0 otherwise 	|
| `review_score` 	| int 	| from 1 to 5 	|
| `number_of_products` 	| int 	| number of products that the order contains 	|
| `number_of_sellers` 	| int 	| number of sellers involved in the order 	|
| `price` 	| float 	| total price of the order paid by customer 	|
| `freight_value` 	| float 	| value of the freight paid by customer 	|
| `distance_customer_seller` 	| float 	| the distance in km between customer and seller (optional) 	|  
  
⚠️ We also want to filter out "non-delivered" orders, unless explicitly specified, otherwise we cannot compute the potential delays.

❓ **Your challenge**: 

- Implement each feature as a separate method within the `Order` class available at `olist/order.py`
- Then, create a method `get_training_data()` that returns the complete DataFrame.

💡 Suggested methodology:
- Use the notebook below to write and test your code step-by-step first
- Then copy the code into `order.py` once you are certain of your code logic
- Focus on the data manipulation logic now, we will analyse the dataset visually in the next challenges

🔥 Notebook best practices (must-read) 👇

<details>
    <summary> - <i>click here</i> - </summary>

From now on, exploratory notebooks are going to become pretty long, and we strongly advise you to follow these notebook principles:
- Code your logic so that your Notebook can always be ran from top to bottom without crashing (Cell --> Run All)
- Name your variables carefully 
- Use dummy names such as `tmp` or `_` for intermediary steps when you know you won't need them for long
- Clear your code and merge cells when relevant to minimize Notebook size (`Shift-M`)
- Hide your cell output if you don't need to see it anymore (double-click on the red `Out[]:` section to the left of your cell).
- Make heavy use of jupyter nbextention `Collapsable Headings` and `Table of Content` (call a TA if you can't find them)
- Use the following shortcuts 
    - `a` to insert a cell above
    - `b` to insert a cell below
    - `dd` to delete a cell
    - `esc` and `arrows` to move between cells
    - `Shift-Enter` to execute cell and move focus to the next one
    - use `Shift + Tab` when you are between method brackets e.g. `groupby()` to get the docs! Repeat a few times to open it permanently

</details>





In [0]:
# Auto reload imported module everytime a jupyter cell is executed (handy for olist.order.py updates)
%load_ext autoreload
%autoreload 2

In [0]:
# Import usual modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime

In [2]:
# Import olist data
from olist.data import Olist
olist = Olist()
data = olist.get_data()
matching_table = olist.get_matching_table()

## Code `order.py`

In [0]:
orders = data['orders'].copy() # good practice to be sure not to modify your `data` variable

### `get_wait_time`
    Return a Dataframe with:
           order_id, wait_time, expected_wait_time, delay_vs_expected, order_status

*- Hints -*
- Don't forget to convert dates from "string" type to "pandas.datetime' using [`pandas.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
- Take time to understand what python [`datetime`](https://docs.python.org/3/library/datetime.html) objects are 

🎁 We give you the pseudo-code below 👇 for this first method:

> 1. Inspect the `orders` dataframe
2. Filter the dataframe on `delivered orders`
3. Handle `datetime`
4. Compute `wait_time`
5. Compute `expected_wait_time`
6. Compute `delay_vs_expected`
7. Check the new dataframe 
8. Once you are satisfied with your code, you can carefully copy-paste it from the notebook to to `olist/order.py`

In [0]:
delivered_orders = orders[orders['order_status']=="delivered"].copy()

delivered_orders['order_purchase_timestamp'] = pd.to_datetime(delivered_orders['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S')
delivered_orders['order_approved_at'] = pd.to_datetime(delivered_orders['order_approved_at'], format='%Y-%m-%d %H:%M:%S')
delivered_orders['order_delivered_carrier_date'] = pd.to_datetime(delivered_orders['order_delivered_carrier_date'], format='%Y-%m-%d %H:%M:%S')
delivered_orders['order_delivered_customer_date'] = pd.to_datetime(delivered_orders['order_delivered_customer_date'], format='%Y-%m-%d %H:%M:%S')
delivered_orders['order_estimated_delivery_date'] = pd.to_datetime(delivered_orders['order_estimated_delivery_date'], format='%Y-%m-%d')

delivered_orders['wait_time'] = delivered_orders['order_delivered_customer_date'] - delivered_orders['order_purchase_timestamp']

delivered_orders['expected_wait_time'] = delivered_orders['order_estimated_delivery_date'] - delivered_orders['order_purchase_timestamp']

delivered_orders['delay_vs_expected'] = delivered_orders['wait_time'] - delivered_orders['expected_wait_time']

delivered_orders['wait_time'] = delivered_orders['wait_time'] / datetime.timedelta(days=1)
delivered_orders['expected_wait_time'] = delivered_orders['expected_wait_time'] / datetime.timedelta(days=1)
delivered_orders['delay_vs_expected'] = delivered_orders['delay_vs_expected'] / datetime.timedelta(days=1)

delivered_orders.loc[delivered_orders['delay_vs_expected'] < 0, 'delay_vs_expected'] = 0

delivered_orders = delivered_orders.filter(['order_id', 'order_status','wait_time', 'expected_wait_time', 'delay_vs_expected'])
delivered_orders

Unnamed: 0,order_id,order_status,wait_time,expected_wait_time,delay_vs_expected
0,e481f51cbdc54678b7cc49136f2d6af7,delivered,8.436574,15.544063,0.0
1,53cdb2fc8bc7dce0b6741e2150273451,delivered,13.782037,19.137766,0.0
2,47770eb9100c2d0c44946d9cf07ec65d,delivered,9.394213,26.639711,0.0
3,949d5b44dbf5de918fe9c16f97b45f8a,delivered,13.208750,26.188819,0.0
4,ad21c59c0840e6cb83a9ceb5573f8159,delivered,2.873877,12.112049,0.0
...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,delivered,8.218009,18.587442,0.0
99437,63943bddc261676b46f01ca7ac2f7bd8,delivered,22.193727,23.459051,0.0
99438,83c1379a015df1e13d02aae0204711ab,delivered,24.859421,30.384225,0.0
99439,11c177c8e97725db2631073c19f07b62,delivered,17.086424,37.105243,0.0


👀 Check the dataframe you've just created. <br/> 

💪 When your code works, commit it to `olist/order.py` <br/>

🙏 Now, test it by running the following cell 👇 

In [6]:
# Test your code here
from olist.order import Order
Order().get_wait_time()

Unnamed: 0,order_id,order_status,wait_time,expected_wait_time,delay_vs_expected
0,e481f51cbdc54678b7cc49136f2d6af7,delivered,8.436574,15.544063,0.0
1,53cdb2fc8bc7dce0b6741e2150273451,delivered,13.782037,19.137766,0.0
2,47770eb9100c2d0c44946d9cf07ec65d,delivered,9.394213,26.639711,0.0
3,949d5b44dbf5de918fe9c16f97b45f8a,delivered,13.208750,26.188819,0.0
4,ad21c59c0840e6cb83a9ceb5573f8159,delivered,2.873877,12.112049,0.0
...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,delivered,8.218009,18.587442,0.0
99437,63943bddc261676b46f01ca7ac2f7bd8,delivered,22.193727,23.459051,0.0
99438,83c1379a015df1e13d02aae0204711ab,delivered,24.859421,30.384225,0.0
99439,11c177c8e97725db2631073c19f07b62,delivered,17.086424,37.105243,0.0


### `get_review_score`
     Returns a DataFrame with:
        order_id, dim_is_five_star, dim_is_one_star, review_score

👉 Load the `reviews`

In [52]:
reviews = data['order_reviews'].copy()
reviews

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53
...,...,...,...,...,...,...,...
99219,574ed12dd733e5fa530cfd4bbf39d7c9,2a8c23fee101d4d5662fa670396eb8da,5,,,2018-07-07 00:00:00,2018-07-14 17:18:30
99220,f3897127253a9592a73be9bdfdf4ed7a,22ec9f0669f784db00fa86d035cf8602,5,,,2017-12-09 00:00:00,2017-12-11 20:06:42
99221,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43
99222,1adeb9d84d72fe4e337617733eb85149,7725825d039fc1f0ceb7635e3f7d9206,4,,,2018-07-01 00:00:00,2018-07-02 12:59:13


👉 Let's create two functions `dim_five_star` and `dim_one_star`  
    We will apply them  `element_wise` to the `review_score` column in the next cell below.

In [54]:
def dim_five_star(d):
    if d == 5:
        return 1
    return 0

def dim_one_star(d):
    if d == 1:
        return 1
    return 0

👉 Use these functions to create two boolean features `dim_is_five_star` and `dim_is_one_star`

In [57]:
reviews['dim_is_five_star'] = reviews['review_score'].apply(dim_five_star)
reviews['dim_is_one_star'] = reviews['review_score'].apply(dim_one_star)
reviews

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,dim_is_five_star,dim_is_one_star
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59,0,0
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13,1,0
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24,1,0
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,1,0
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,1,0
...,...,...,...,...,...,...,...,...,...
99219,574ed12dd733e5fa530cfd4bbf39d7c9,2a8c23fee101d4d5662fa670396eb8da,5,,,2018-07-07 00:00:00,2018-07-14 17:18:30,1,0
99220,f3897127253a9592a73be9bdfdf4ed7a,22ec9f0669f784db00fa86d035cf8602,5,,,2017-12-09 00:00:00,2017-12-11 20:06:42,1,0
99221,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,1,0
99222,1adeb9d84d72fe4e337617733eb85149,7725825d039fc1f0ceb7635e3f7d9206,4,,,2018-07-01 00:00:00,2018-07-02 12:59:13,0,0


Once again, 

👀 Check the dataframe you've just created. <br/> 

💪 When your code works, commit it to `olist/order.py` <br/>

🙏 Now, test it by running the following cell 👇 

In [13]:
# Test your code here
from olist.order import Order
Order().get_review_score()

Unnamed: 0,order_id,dim_is_five_star,dim_is_one_star,review_score
0,73fc7af87114b39712e6da79b0a377eb,False,False,4
1,a548910a1c6147796b98fdf73dbeba33,True,False,5
2,f9e4b658b201a9f2ecdecbb34bed034b,True,False,5
3,658677c97b385a9be170737859d3511b,True,False,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,True,False,5
...,...,...,...,...
99219,2a8c23fee101d4d5662fa670396eb8da,True,False,5
99220,22ec9f0669f784db00fa86d035cf8602,True,False,5
99221,55d4004744368f5571d1f590031933e4,True,False,5
99222,7725825d039fc1f0ceb7635e3f7d9206,False,False,4


#### Check your code

In [14]:
from nbresult import ChallengeResult

result = ChallengeResult('reviews',
    dim_five_star=dim_five_star(5),
    dim_not_five_star=dim_five_star(3),
    dim_one_star=dim_one_star(1),
    dim_not_one_star=dim_one_star(2)
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/useradd/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/useradd/code/LucaVanTichelen/data-challenges/04-Decision-Science/02-Statistical-Inference/01-Orders
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 4 items

tests/test_reviews.py::TestReviews::test_dim_five_star [32mPASSED[0m[32m            [ 25%][0m
tests/test_reviews.py::TestReviews::test_dim_not_five_star [32mPASSED[0m[32m        [ 50%][0m
tests/test_reviews.py::TestReviews::test_dim_not_one_star [32mPASSED[0m[32m         [ 75%][0m
tests/test_reviews.py::TestReviews::test_dim_one_star [32mPASSED[0m[32m             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/reviews.pickle

[32mgit[39m commit -m [33m'Completed reviews step'[39m

[32mgit[39m push origin master


### `get_number_products`:
     Returns a DataFrame with:
        order_id, number_of_products (total number of products per order)

In [60]:
products = data['order_items'].copy()
products = products[['order_id', 'product_id']].groupby('order_id', as_index=False).count().rename(columns={'product_id':'number_of_products'}).sort_values('number_of_products')
products

Unnamed: 0,order_id,number_of_products
0,00010242fe8c5a6d1ba2dd792cb16214,1
64059,a6e9d106235bcf1dda54253686d89e99,1
64058,a6e9b80a7636eb8dd592dbb3e20d0a91,1
64057,a6e963c11e80432334e984ead4797a8b,1
64056,a6e8ad5db31e71f5f12671af561acb4a,1
...,...,...
25583,428a2f660dc84138d969ccd69a0ab6d5,15
60941,9ef13efd6949e4573a18964dd1bbe7f5,15
10459,1b15974a0141d54e36626dca3fdc731a,20
65715,ab14fdcfbe524636d65ee38360e22ce8,20


👉 Same routine: 
* check your dataframe, 
* commit your code to `olist/order.py`
* and check that it truly works.

In [61]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_number_products()

Unnamed: 0,order_id,number_of_products
0,00010242fe8c5a6d1ba2dd792cb16214,1
64059,a6e9d106235bcf1dda54253686d89e99,1
64058,a6e9b80a7636eb8dd592dbb3e20d0a91,1
64057,a6e963c11e80432334e984ead4797a8b,1
64056,a6e8ad5db31e71f5f12671af561acb4a,1
...,...,...
25583,428a2f660dc84138d969ccd69a0ab6d5,15
60941,9ef13efd6949e4573a18964dd1bbe7f5,15
10459,1b15974a0141d54e36626dca3fdc731a,20
65715,ab14fdcfbe524636d65ee38360e22ce8,20


### `get_number_sellers`:
     Returns a DataFrame with:
        order_id, number_of_sellers (total number of unique sellers per order)
        
<details>
    <summary>- <i>Hint</i> -</summary>

`pd.Series.nunique()`
</details>

In [67]:
sellers = data['order_items'].copy()
sellers = sellers[['order_id', 'seller_id']].groupby('order_id', as_index=False).nunique().rename(columns={'seller_id':'number_of_sellers'}).sort_values('number_of_sellers')
sellers

Unnamed: 0,order_id,number_of_sellers
0,00010242fe8c5a6d1ba2dd792cb16214,1
65559,aaaf314a8cf0d0da71e52c6cd4184cbd,1
65558,aaaea350ff8a957595f3c631d6b63d1b,1
65557,aaae80f5b6239bd9e1b22e9aa542c3e8,1
65556,aaabf43feb9498d9de4588eb73231c25,1
...,...,...
11231,1d23106803c48c391366ff224513fb7f,4
53796,8c2b13adf3f377c8f2b06b04321b0925,4
55847,91be51c856a90d7efe86cf9d082d6ae3,4
79967,cf5c8d9f52807cb2d2f0a0ff54c478da,5


In [None]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_number_sellers()

### `get_price_and_freight`
     Returns a DataFrame with:
        order_id, price, freight_value

<details>
    <summary>- <i>Hint -</i></summary>

`pd.Series.agg()` allows you to apply one transformation method per column of your groupby object
</details>

In [None]:
# YOUR CODE HERE

In [None]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_price_and_freight()

### `get_distance_seller_customer` 
**(OPTIONAL - Try  to code this function only after finishing today's challenges - Skip to next section)**

    Returns a Dataframe with:
        [order_id, distance_seller_customer] 
               (the distance in km between customer and seller)

💡Have a look at the `haversine_distance` formula we coded for you in the `olist.utils` module

In [0]:
# YOUR CODE HERE

👀 Check your new dataframe and commit your code to olist/order.py when it works. 

In [None]:
# YOUR CODE HERE

# Test your newly coded module

❓ Time to code `get_training_data` making use of your previous coded methods.

In [None]:
from olist.order import Order
from nbresult import ChallengeResult
data = Order().get_training_data()
result = ChallengeResult('training',
    shape=data.shape,
    columns=sorted(list(data.columns))
)
result.write()
print(result.check())

🏁 Congratulations! 

⌛️ Commit and push your notebook before starting the next challenge.