# Orders

✏️ **Exercise**

Today, we will investigate the **orders**, and their associated **review score**.

👉 Our goal is to create a DataFrame with the following features:


| feature_name 	| type 	| description 	|
|:---	|:---:	|:---	|
| `order_id` 	| str 	| the id of the order 	|
| `wait_time` 	| float 	| the number of days between order_date and delivered_date 	|
| `expected_wait_time` 	| float 	| the number of days between order_date and estimated_delivery_date 	|
| `delay_vs_expected` 	| float 	| if the actual delivery date is later than the estimated delivery date, returns the number of days between the two dates, otherwise return 0 	|
| `order_status` 	| str 	| the status of the order 	|
| `dim_is_five_star` 	| int 	| 1 if the order received a five-star review, 0 otherwise 	|
| `dim_is_one_star` 	| int 	| 1 if the order received a one_star, 0 otherwise 	|
| `review_score` 	| int 	| from 1 to 5 	|
| `number_of_products` 	| int 	| number of products that the order contains 	|
| `number_of_sellers` 	| int 	| number of sellers involved in the order 	|
| `price` 	| float 	| total price of the order paid by customer 	|
| `freight_value` 	| float 	| value of the freight paid by customer 	|
| `distance_customer_seller` 	| float 	| the distance in km between customer and seller (optional) 	|  
  
⚠️ We also want to filter out "non-delivered" orders, unless explicitly specified, otherwise we cannot compute the potential delays.

❓ **Your challenge**: 

- Implement each feature as a separate method within the `Order` class available at `olist/order.py`
- Then, create a method `get_training_data()` that returns the complete DataFrame **without `NaN`s**.

💡 Suggested methodology:
- Use the notebook below to write and test your code step-by-step first
- Then copy the code into `order.py` once you are certain of your code logic
- Focus on the data manipulation logic now, we will analyse the dataset visually in the next challenges

🔥 Notebook best practices (must-read) 👇

<details>
    <summary>▸ <i>click here</i></summary>

From now on, exploratory notebooks are going to become pretty long, and we strongly advise you to follow these notebook principles:
- Code your logic so that your Notebook can always be ran from top to bottom without crashing (Cell --> Run All)
- Name your variables carefully 
- Use dummy names such as `tmp` or `_` for intermediary steps when you know you won't need them for long
- Clear your code and merge cells when relevant to minimize Notebook size (`Shift-M`)
- Hide your cell output if you don't need to see it anymore (double-click on the red `Out[]:` section to the left of your cell).
- Make heavy use of jupyter nbextention `Collapsible Headings` and `Table of Content` (call a TA if you can't find them)
- Use the following shortcuts 
    - `a` to insert a cell above
    - `b` to insert a cell below
    - `dd` to delete a cell
    - `esc` and `arrows` to move between cells
    - `Shift-Enter` to execute cell and move focus to the next one
    - use `Shift + Tab` when you are between method brackets e.g. `groupby()` to get the docs! Repeat a few times to open it permanently

</details>





In [1]:
# Auto reload imported module every time a jupyter cell is executed (handy for olist.order.py updates)
%load_ext autoreload
%autoreload 2


In [2]:
# Import usual modules
import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [4]:
# Import olist data
root_path = os.path.join(os.getcwd(),'..')
if root_path not in sys.path:
    sys.path.append(root_path)

from utils.data import Olist
olist = Olist()
data = olist.get_data()

In [5]:
data.keys()


dict_keys(['orders', 'customers', 'order_items', 'products', 'product_category_name_translation', 'sellers', 'order_payments', 'geolocation', 'order_reviews'])

In [28]:
orders = data['orders'].copy()
temp_orders = data['orders'].copy()

assert(orders.shape == (99441, 8))


## 1. Code `order.py`

### a) `get_wait_time`
    ❓ Return a Dataframe with:
           order_id, wait_time, expected_wait_time, delay_vs_expected, order_status


🎁 We give you the pseudo-code below 👇 for this first method:

> 1. Inspect the `orders` dataframe
2. Filter the dataframe on `delivered orders`
3. Handle `datetime`
    - Take time to understand what python [`datetime`](https://docs.python.org/3/library/datetime.html) objects are
    - and convert dates from "string" type to "pandas.datetime' using [`pandas.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
4. Compute `wait_time`
5. Compute `expected_wait_time`
6. Compute `delay_vs_expected`
7. Check the new dataframe 
8. Once you are satisfied with your code, you can carefully copy-paste it from the notebook to to `olist/order.py`

<details>
    <summary>💡Hint</summary>

For both `wait_time` and `delay_vs_expected`, you need to subtract the relevant dates/timestamps to get the time difference between the `pandas.datetime` objects. Then, you can either use [`datetime.timedelta()`](https://docs.python.org/3/library/datetime.html#timedelta-objects) or [`np.timedelta64()`](https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic) to find out how many days that subtraction represents!

</details>

In [29]:
# Filter the dataframe on delivered orders
orders = orders.query("order_status == 'delivered'")

# Convert dates from "string" type to "pandas.datetime"
to_date_time_col = ['order_purchase_timestamp',
                    'order_approved_at',
                    'order_delivered_carrier_date',
                    'order_delivered_customer_date',
                    'order_estimated_delivery_date']
for col in to_date_time_col:
    orders[col] = pd.to_datetime(orders[col])

# Compute wait_time and store it in a new column
orders['wait_time'] = orders['order_delivered_customer_date'] - orders['order_purchase_timestamp']
orders['wait_time'] = round(pd.to_timedelta(orders['wait_time']) / pd.offsets.Day(1), 1)

# Compute expected_wait_time and store it in a new column
orders['expected_wait_time'] = orders['order_estimated_delivery_date'] - orders['order_purchase_timestamp']
orders['expected_wait_time'] = round(pd.to_timedelta(orders['expected_wait_time']) / pd.offsets.Day(1), 1)

# Compute delay_vs_expected and store it in a new column
orders['delay_vs_expected'] = orders.apply(lambda x:
                                                  (x['wait_time'] - x['expected_wait_time'])
                                                  if x['wait_time'] > x['expected_wait_time'] else 0,
                                                  axis=1)

orders = orders[['order_id', 'order_status', 'wait_time', 'expected_wait_time', 'delay_vs_expected']]


In [30]:
orders


Unnamed: 0,order_id,order_status,wait_time,expected_wait_time,delay_vs_expected
0,e481f51cbdc54678b7cc49136f2d6af7,delivered,8.4,15.5,0.0
1,53cdb2fc8bc7dce0b6741e2150273451,delivered,13.8,19.1,0.0
2,47770eb9100c2d0c44946d9cf07ec65d,delivered,9.4,26.6,0.0
3,949d5b44dbf5de918fe9c16f97b45f8a,delivered,13.2,26.2,0.0
4,ad21c59c0840e6cb83a9ceb5573f8159,delivered,2.9,12.1,0.0
...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,delivered,8.2,18.6,0.0
99437,63943bddc261676b46f01ca7ac2f7bd8,delivered,22.2,23.5,0.0
99438,83c1379a015df1e13d02aae0204711ab,delivered,24.9,30.4,0.0
99439,11c177c8e97725db2631073c19f07b62,delivered,17.1,37.1,0.0


👀 Check the dataframe you've just created. <br/> 

💪 When your code works, commit it to `olist/order.py` <br/>

🧪 Now, test it by running the following cell 👇 

In [8]:
# Test your code here
from olist.order import Order
Order().get_wait_time()


Unnamed: 0,order_id,order_status,wait_time,expected_wait_time,delay_vs_expected
0,e481f51cbdc54678b7cc49136f2d6af7,delivered,8.4,15.5,0.0
1,53cdb2fc8bc7dce0b6741e2150273451,delivered,13.8,19.1,0.0
2,47770eb9100c2d0c44946d9cf07ec65d,delivered,9.4,26.6,0.0
3,949d5b44dbf5de918fe9c16f97b45f8a,delivered,13.2,26.2,0.0
4,ad21c59c0840e6cb83a9ceb5573f8159,delivered,2.9,12.1,0.0
...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,delivered,8.2,18.6,0.0
99437,63943bddc261676b46f01ca7ac2f7bd8,delivered,22.2,23.5,0.0
99438,83c1379a015df1e13d02aae0204711ab,delivered,24.9,30.4,0.0
99439,11c177c8e97725db2631073c19f07b62,delivered,17.1,37.1,0.0


In [9]:
from nbresult import ChallengeResult
test = Order().get_wait_time()
result = ChallengeResult('wait_time', dve_type=test["delay_vs_expected"].dtype, shape=test.shape, dve_min=test["delay_vs_expected"].min(), dve_max=test["delay_vs_expected"].max())
result.write(); print(result.check())



platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/jarisfenner/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/jarisfenner/code/Kaaykun/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_wait_time.py::TestWaitTime::test_wait_time [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/wait_time.pickle

[32mgit[39m commit -m [33m'Completed wait_time step'[39m

[32mgit[39m push origin master



### b) `get_review_score`
     ❓ Returns a DataFrame with:
        order_id, dim_is_five_star, dim_is_one_star, review_score

dim_is_$N$_star should contain `1` if review_score=$N$ and `0` otherwise 

<details>
    <summary markdown='span'>Hints</summary>

Think about `Series.map()` or `DataFrame.apply()`
    
</details>

👉 We load the `reviews` for you

In [10]:
reviews = data['order_reviews'].copy()
assert(reviews.shape == (99224,7))
reviews


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53
...,...,...,...,...,...,...,...
99219,574ed12dd733e5fa530cfd4bbf39d7c9,2a8c23fee101d4d5662fa670396eb8da,5,,,2018-07-07 00:00:00,2018-07-14 17:18:30
99220,f3897127253a9592a73be9bdfdf4ed7a,22ec9f0669f784db00fa86d035cf8602,5,,,2017-12-09 00:00:00,2017-12-11 20:06:42
99221,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43
99222,1adeb9d84d72fe4e337617733eb85149,7725825d039fc1f0ceb7635e3f7d9206,4,,,2018-07-01 00:00:00,2018-07-02 12:59:13


In [11]:
reviews = reviews[['order_id', 'review_score']]
# Should contain 1 if review score is 5, else 0
reviews['dim_is_five_star'] = reviews.apply(lambda x: 1 if x['review_score'] == 5 else 0, axis=1)
# Should contain 1 if review score is 1, else 0
reviews['dim_is_one_star'] = reviews.apply(lambda x: 1 if x['review_score'] == 1 else 0, axis=1)
# return reviews


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews['dim_is_five_star'] = reviews.apply(lambda x: 1 if x['review_score'] == 5 else 0, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews['dim_is_one_star'] = reviews.apply(lambda x: 1 if x['review_score'] == 1 else 0, axis=1)


Once again, 

👀 Check the dataframe you've just created. <br/> 

💪 When your code works, commit it to `olist/order.py` <br/>

🧪 Now, test it by running the following cell 👇 

In [12]:
# Test your code here
from olist.order import Order
Order().get_review_score()


Unnamed: 0,order_id,review_score,dim_is_five_star,dim_is_one_star
0,73fc7af87114b39712e6da79b0a377eb,4,0,0
1,a548910a1c6147796b98fdf73dbeba33,5,1,0
2,f9e4b658b201a9f2ecdecbb34bed034b,5,1,0
3,658677c97b385a9be170737859d3511b,5,1,0
4,8e6bfb81e283fa7e4f11123a3fb894f1,5,1,0
...,...,...,...,...
99219,2a8c23fee101d4d5662fa670396eb8da,5,1,0
99220,22ec9f0669f784db00fa86d035cf8602,5,1,0
99221,55d4004744368f5571d1f590031933e4,5,1,0
99222,7725825d039fc1f0ceb7635e3f7d9206,4,0,0


In [13]:
from nbresult import ChallengeResult
result = ChallengeResult('review_score', shape=Order().get_review_score().shape)
result.write(); print(result.check())



platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/jarisfenner/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/jarisfenner/code/Kaaykun/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_review_score.py::TestReviewScore::test_review_score [32mPASSED[0m[32m          [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/review_score.pickle

[32mgit[39m commit -m [33m'Completed review_score step'[39m

[32mgit[39m push origin master



### c) `get_number_products`:
     ❓ Returns a DataFrame with:
        order_id, number_of_products (total number of products per order)

In [14]:
temp = data['products'].copy()
temp


Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0
...,...,...,...,...,...,...,...,...,...
32946,a0b7d5a992ccda646f2d34e418fff5a0,moveis_decoracao,45.0,67.0,2.0,12300.0,40.0,40.0,40.0
32947,bf4538d88321d0fd4412a93c974510e6,construcao_ferramentas_iluminacao,41.0,971.0,1.0,1700.0,16.0,19.0,16.0
32948,9a7c6041fa9592d9d9ef6cfe62a71f8c,cama_mesa_banho,50.0,799.0,1.0,1400.0,27.0,7.0,27.0
32949,83808703fc0706a22e264b9d75f04a2e,informatica_acessorios,60.0,156.0,2.0,700.0,31.0,13.0,20.0


products = pd.merge(temp, data['order_items'], how='left', on='product_id')
products['']


In [15]:
products = pd.merge(temp, data['order_items'], how='left', on='product_id')
products = products.groupby(by='order_id')[['product_id']].count()
products = products.rename(columns={'product_id': 'number_of_products'}).reset_index()
products


Unnamed: 0,order_id,number_of_products
0,00010242fe8c5a6d1ba2dd792cb16214,1
1,00018f77f2f0320c557190d7a144bdd3,1
2,000229ec398224ef6ca0657da4fc703e,1
3,00024acbcdf0a6daa1e931b038114c75,1
4,00042b26cf59d7ce69dfabb4e55b4fd9,1
...,...,...
98661,fffc94f6ce00a00581880bf54a75a037,1
98662,fffcd46ef2263f404302a634eb57f7eb,1
98663,fffce4705a9662cd70adb13d4a31832d,1
98664,fffe18544ffabc95dfada21779c9644f,1


🧪 Same routine: 
* check your dataframe, 
* commit your code to `olist/order.py`
* and check that it truly works.

In [16]:
from nbresult import ChallengeResult
result = ChallengeResult('number_products', shape=Order().get_number_products().shape)
result.write(); print(result.check())



platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/jarisfenner/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/jarisfenner/code/Kaaykun/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_number_products.py::TestNumberProducts::test_review_score [32mPASSED[0m[32m    [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/number_products.pickle

[32mgit[39m commit -m [33m'Completed number_products step'[39m

[32mgit[39m push origin master



### d) `get_number_sellers`:
     ❓ Returns a DataFrame with:
        order_id, number_of_sellers (total number of unique sellers per order)
        
<details>
    <summary>▸ <i>Hint</i></summary>

`pd.Series.nunique()`
</details>

In [17]:
sellers = pd.merge(data['sellers'], data['order_items'], how='left', on='seller_id')
sellers = sellers.groupby(by='order_id')[['seller_id']].nunique()
sellers = sellers.rename(columns={'seller_id': 'number_of_sellers'}).reset_index()

sellers


Unnamed: 0,order_id,number_of_sellers
0,00010242fe8c5a6d1ba2dd792cb16214,1
1,00018f77f2f0320c557190d7a144bdd3,1
2,000229ec398224ef6ca0657da4fc703e,1
3,00024acbcdf0a6daa1e931b038114c75,1
4,00042b26cf59d7ce69dfabb4e55b4fd9,1
...,...,...
98661,fffc94f6ce00a00581880bf54a75a037,1
98662,fffcd46ef2263f404302a634eb57f7eb,1
98663,fffce4705a9662cd70adb13d4a31832d,1
98664,fffe18544ffabc95dfada21779c9644f,1


In [18]:
from nbresult import ChallengeResult
result = ChallengeResult('number_sellers', shape=Order().get_number_sellers().shape)
result.write(); print(result.check())



platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/jarisfenner/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/jarisfenner/code/Kaaykun/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_number_sellers.py::TestNumberSellers::test_number_seller [32mPASSED[0m[32m     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/number_sellers.pickle

[32mgit[39m commit -m [33m'Completed number_sellers step'[39m

[32mgit[39m push origin master



### e) `get_price_and_freight`
     Returns a DataFrame with:
        order_id, price, freight_value

<details>
    <summary>▸ <i>Hint</i></summary>

`pd.Series.agg()` allows you to apply one transformation method per column of your groupby object
</details>

In [19]:
# YOUR CODE HERE
prices = data['order_items'][['order_id', 'price', 'freight_value']]
prices = prices.groupby(by='order_id').agg(sum).reset_index()
prices


Unnamed: 0,order_id,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,58.90,13.29
1,00018f77f2f0320c557190d7a144bdd3,239.90,19.93
2,000229ec398224ef6ca0657da4fc703e,199.00,17.87
3,00024acbcdf0a6daa1e931b038114c75,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,199.90,18.14
...,...,...,...
98661,fffc94f6ce00a00581880bf54a75a037,299.99,43.41
98662,fffcd46ef2263f404302a634eb57f7eb,350.00,36.53
98663,fffce4705a9662cd70adb13d4a31832d,99.90,16.95
98664,fffe18544ffabc95dfada21779c9644f,55.99,8.72


In [20]:
from nbresult import ChallengeResult
result = ChallengeResult('price', shape=Order().get_price_and_freight().shape)
result.write(); print(result.check())



platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/jarisfenner/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/jarisfenner/code/Kaaykun/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_price.py::TestPrice::test_price [32mPASSED[0m[32m                              [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/price.pickle

[32mgit[39m commit -m [33m'Completed price step'[39m

[32mgit[39m push origin master



### e) [OPTIONAL] `get_distance_seller_customer` 
**(Try  to code this function only after finishing today's challenges - Skip to next section)**

    ❓ Returns a Dataframe with:
        order_id, distance_seller_customer (the distance in km between customer and seller)

💡Have a look at the `haversine_distance` formula we coded for you in the `olist.utils` module

In [21]:
# data['customers'].columns


In [22]:
# YOUR CODE HERE
# distance = data['orders'].merge(data['customers'],
#                                 how='inner',
#                                 on='customer_id') \
#                          .merge(data['geolocation'],
#                                 how='inner',
#                                 left_on='customer_zip_code_prefix',
#                                 right_on='geolocation_zip_code_prefix') \
#                          .merge(data['sellers'],
#                                 how='inner',
#                                 left_on='geolocation_zip_code_prefix',
#                                 right_on='seller_zip_code_prefix')
# distance = distance[['order_id', 'geolocation_lat', 'geolocation_lng']]

# distance.head()


In [23]:
# # distance.shape
# temp = data['geolocation'].merge(data['customers'], how='inner', left_on='geolocation_zip_code_prefix', right_on='customer_zip_code_prefix')
# # temp = temp.merge(data['orders'], how='right', on='customer_id')
# temp


👀 Check your new dataframe and commit your code to olist/order.py when it works. 

In [24]:
# YOUR CODE HERE


🧪  Test your code

In [25]:
from nbresult import ChallengeResult

result = ChallengeResult('distance',
    mean = Order().get_distance_seller_customer()['distance_seller_customer'].mean())
result.write()
print(result.check())



platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/jarisfenner/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/jarisfenner/code/Kaaykun/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_distance.py::TestDistance::test_distance [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/distance.pickle

[32mgit[39m commit -m [33m'Completed distance step'[39m

[32mgit[39m push origin master



## 2. All at once: `get_training_data`

❓ Time to code `get_training_data` making use of your previous coded methods, to gather all order features in one table

In [26]:
# orders = ['order_id', 'order_status', 'wait_time', 'expected_wait_time', 'delay_vs_expected'] 96478
# reviews = ['order_id', 'review_score', 'dim_is_five_star', 'dim_is_one_star'] 99224
# products = ['order_id', 'number_of_items'] 98666
# sellers = ['order_id', 'number_of_sellers'] 98666
# prices = ['order_id', 'price', 'freight_value'] 98666
# Merge all the data frames on order id
training_data_df = orders.merge(reviews, on='order_id', how='inner') \
                         .merge(products, on='order_id', how='inner') \
                         .merge(sellers, on='order_id', how='inner') \
                         .merge(prices, on='order_id', how='inner')
# Rearrange the columns
training_data_df = training_data_df[['order_id', 'wait_time', 'expected_wait_time',
                                     'delay_vs_expected', 'order_status', 'dim_is_five_star',
                                     'dim_is_one_star', 'review_score', 'number_of_products',
                                     'number_of_sellers', 'price', 'freight_value']]
# training_data_df.describe()
training_data_df = training_data_df.dropna()
training_data_df


Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value
0,e481f51cbdc54678b7cc49136f2d6af7,8.4,15.5,0.0,delivered,0,0,4,1,1,29.99,8.72
1,53cdb2fc8bc7dce0b6741e2150273451,13.8,19.1,0.0,delivered,0,0,4,1,1,118.70,22.76
2,47770eb9100c2d0c44946d9cf07ec65d,9.4,26.6,0.0,delivered,1,0,5,1,1,159.90,19.22
3,949d5b44dbf5de918fe9c16f97b45f8a,13.2,26.2,0.0,delivered,1,0,5,1,1,45.00,27.20
4,ad21c59c0840e6cb83a9ceb5573f8159,2.9,12.1,0.0,delivered,1,0,5,1,1,19.90,8.72
...,...,...,...,...,...,...,...,...,...,...,...,...
96356,9c5dedf39a927c1b2549525ed64a053c,8.2,18.6,0.0,delivered,1,0,5,1,1,72.00,13.08
96357,63943bddc261676b46f01ca7ac2f7bd8,22.2,23.5,0.0,delivered,0,0,4,1,1,174.90,20.10
96358,83c1379a015df1e13d02aae0204711ab,24.9,30.4,0.0,delivered,1,0,5,1,1,205.99,65.02
96359,11c177c8e97725db2631073c19f07b62,17.1,37.1,0.0,delivered,0,0,2,2,1,359.98,81.18


🧪  Test it below

In [27]:
from nbresult import ChallengeResult
from olist.order import Order
data = Order().get_training_data()

result = ChallengeResult('training',
    shape=data.shape,
    columns=sorted(list(data.columns))
)
result.write()
print(result.check())



platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/jarisfenner/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/jarisfenner/code/Kaaykun/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: typeguard-2.13.3, asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_training.py::TestTraining::test_training_data_columns [32mPASSED[0m[32m        [ 50%][0m
test_training.py::TestTraining::test_training_data_shape [32mPASSED[0m[32m          [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/training.pickle

[32mgit[39m commit -m [33m'Completed training step'[39m

[32mgit[39m push origin master



🏁 Congratulations! 

💾 Commit and push your notebook before starting the next challenge.