## Initial research.

**Task:** Conduct a preliminary study of the data and formulate what should be considered a _purchase_.
You can justify your choice with the help of payment facts, order statuses, and other available data.

In [4]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### 1. Let's explore our dataframes

**Data Understanding**
- Dataframe shapes
- Dataframe preview
- Datatypes
- Unique values


 #### Customers dataset:

In [5]:
customers = pd.read_csv('olist_customers_dataset.csv')
customers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP


In [6]:
customers.dtypes

customer_id                 object
customer_unique_id          object
customer_zip_code_prefix     int64
customer_city               object
customer_state              object
dtype: object

In [7]:
customers.shape

(99441, 5)

In [8]:
customers.nunique() 

customer_id                 99441
customer_unique_id          96096
customer_zip_code_prefix    14994
customer_city                4119
customer_state                 27
dtype: int64

**Observations:**
We have 96096 unique `customer_unique_id`, 3k less than `customer_id`, the number of which matches the dataframe size.
All customers have their address information.

Let's confirm that `customer_unique_id` can be listed multiple times.

In [9]:
customers.customer_unique_id.value_counts().head()

customer_unique_id
8d50f5eadf50201ccdcedfb9e2ac8455    17
3e43e6105506432c953e165fb2acf44c     9
1b6c7548a2a1f9037c1fd3ddfed95f33     7
ca77025e7201e3b30c44b472ff346268     7
6469f99c1f9dfae7733b25662e7f1782     7
Name: count, dtype: int64

**Result**:

Some users might have multiple records/accounts, resulting in different `customer_id`s. As these are the same users, we shall use `customer_unique_id` as the primary grouping value to have correct data per user.

---

### Orders dataset:

In [10]:
orders = pd.read_csv('olist_orders_dataset.csv')
orders.dtypes

order_id                         object
customer_id                      object
order_status                     object
order_purchase_timestamp         object
order_approved_at                object
order_delivered_carrier_date     object
order_delivered_customer_date    object
order_estimated_delivery_date    object
dtype: object

Lots of dates are shown in object format. Let's change the import to have them in proper datetime formats.

In [11]:
orders = pd.read_csv('olist_orders_dataset.csv', 
                     parse_dates=['order_purchase_timestamp', 'order_approved_at', 
                                  'order_delivered_carrier_date', 'order_delivered_customer_date',
                                 'order_estimated_delivery_date'])
orders.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26


In [12]:
orders.dtypes

order_id                                 object
customer_id                              object
order_status                             object
order_purchase_timestamp         datetime64[ns]
order_approved_at                datetime64[ns]
order_delivered_carrier_date     datetime64[ns]
order_delivered_customer_date    datetime64[ns]
order_estimated_delivery_date    datetime64[ns]
dtype: object

In [13]:
orders.shape

(99441, 8)

In [14]:
orders.nunique()

order_id                         99441
customer_id                      99441
order_status                         8
order_purchase_timestamp         98875
order_approved_at                90733
order_delivered_carrier_date     81018
order_delivered_customer_date    95664
order_estimated_delivery_date      459
dtype: int64

**Observations:**
The number of unique `order_id` matches unique `customer_id`

**Result**:
Each row corresponds with the unique purchase from the unique one-time identifier of our user

---

Let's explore the time ranges we have in our dataset:

In [19]:
orders[["order_purchase_timestamp", "order_approved_at", "order_delivered_carrier_date", 
    "order_delivered_customer_date", "order_estimated_delivery_date"]].agg(['min', 'max'])

Unnamed: 0,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
min,2016-09-04 21:15:19,2016-09-15 12:16:38,2016-10-08 10:34:01,2016-10-11 13:46:32,2016-09-30
max,2018-10-17 17:30:18,2018-09-03 17:40:06,2018-09-11 19:48:28,2018-10-17 13:22:46,2018-11-12


Thus, we have data from *2016-09-04* till *2018-10-17,* but
- `estimated_delivery_date` is outside of the range with *November 2018*.
- Maximum date corresponds with the canceled order

If we exclude canceled orders, we'll get the following:

In [20]:
not_cenceled = orders.loc[orders.order_status != 'canceled'].sort_values('order_estimated_delivery_date', 
                                                                         ascending=False)
pd.concat([not_cenceled.head(1), not_cenceled.tail(1)])

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
82521,0ba4f7e89d304a02edca85087abfd14c,b5c1c56fe1ec3cec6893b98d90a339bd,delivered,2018-08-13 22:51:49,2018-08-13 23:04:09,2018-08-22 14:04:00,2018-08-30 23:38:26,2018-10-25
30710,bfbd0f9bdef84302105ad712db648a6c,86dc2ffce2dfff336de2f386a786e574,delivered,2016-09-15 12:16:38,2016-09-15 12:16:38,2016-11-07 17:11:53,2016-11-09 07:47:38,2016-10-04


The max date by `estimated_delivery_date` is now *2018-10-25* for the order already delivered in August.
Last actual delivery is on *2018-10-17*, the same as last order, so:

**Result**
- We can consider dates from *2016-09-04* to *2018-10-17* as an actual time period of the data we have 

---


### Order status:

In [9]:
orders.order_status.value_counts()

delivered      96478
shipped         1107
canceled         625
unavailable      609
invoiced         314
processing       301
created            5
approved           2
Name: order_status, dtype: int64

Let's divide the statuses into groups by how they can be related to the purchase:

| Is definitely a purchase | Definitely NOT a purchase | Need to be checked | Unknown |
| ------------------------ | --------------------------- | ------------------------------------ | ---------------- |
| - delivered              | - canceled                  | - shipped                            | - created             |
|                          |                | - invoiced | - approved                 |
|                          |                             | - processing   | - unavailable            |


#### Let's check `approved`, `created`, and `unavailable` statuses first:

In [15]:
orders.loc[orders.order_status == 'approved']

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
44897,a2e4c44360b4a57bdff22f3a4630c173,8886130db0ea6e9e70ba0b03d7c0d286,approved,2017-02-06 20:18:17,2017-02-06 20:30:19,NaT,NaT,2017-03-01
88457,132f1e724165a07f6362532bfb97486e,b2191912d8ad6eac2e4dc3b6e1459515,approved,2017-04-25 01:25:34,2017-04-30 20:32:41,NaT,NaT,2017-05-22


We have only two orders that were approved in 2017 but never delivered, thus:

- *`approved` status is not considered as a purchase*

In [12]:
orders.loc[orders.order_status == 'created']

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
7434,b5359909123fa03c50bdb0cfed07f098,438449d4af8980d107bf04571413a8e7,created,2017-12-05 01:07:52,NaT,NaT,NaT,2018-01-11
9238,dba5062fbda3af4fb6c33b1e040ca38f,964a6df3d9bdf60fe3e7b8bb69ed893a,created,2018-02-09 17:21:04,NaT,NaT,NaT,2018-03-07
21441,7a4df5d8cff4090e541401a20a22bb80,725e9c75605414b21fd8c8d5a1c2f1d6,created,2017-11-25 11:10:33,NaT,NaT,NaT,2017-12-12
55086,35de4050331c6c644cddc86f4f2d0d64,4ee64f4bfc542546f422da0aeb462853,created,2017-12-05 01:07:58,NaT,NaT,NaT,2018-01-08
58958,90ab3e7d52544ec7bc3363c82689965f,7d61b9f4f216052ba664f22e9c504ef1,created,2017-11-06 13:12:34,NaT,NaT,NaT,2017-12-01


We have five orders created on the edge of 2017 - 2018 but never approved or delivered, thus:

- *`created` status is not considered as a purchase*

In [21]:
orders.loc[orders.order_status == 'unavailable'].shape

(609, 8)

In [22]:
orders.loc[orders.order_status == 'unavailable'].head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
266,8e24261a7e58791d10cb1bf9da94df5c,64a254d30eed42cd0e6c36dddb88adf0,unavailable,2017-11-16 15:09:28,2017-11-16 15:26:57,NaT,NaT,2017-12-05
586,c272bcd21c287498b4883c7512019702,9582c5bbecc65eb568e2c1d839b5cba1,unavailable,2018-01-31 11:31:37,2018-01-31 14:23:50,NaT,NaT,2018-02-16
687,37553832a3a89c9b2db59701c357ca67,7607cd563696c27ede287e515812d528,unavailable,2017-08-14 17:38:02,2017-08-17 00:15:18,NaT,NaT,2017-09-05
737,d57e15fb07fd180f06ab3926b39edcd2,470b93b3f1cde85550fc74cd3a476c78,unavailable,2018-01-08 19:39:03,2018-01-09 07:26:08,NaT,NaT,2018-02-06
1160,2f634e2cebf8c0283e7ef0989f77d217,7353b0fb8e8d9675e3a704c60ca44ebe,unavailable,2017-09-27 20:55:33,2017-09-28 01:32:50,NaT,NaT,2017-10-27


There're lots of unavailable orders. Let's drop n/a values to see if any of these were delivered.

In [23]:
orders.loc[orders.order_status == 'unavailable'].dropna().size

0

No of the `unavailable` orders were delivered.

- *`unavailable` status is not considered as a purchase*

### Let's check `shipped`, `invoiced`, and `processing` statuses:

In [24]:
shipped = orders.loc[orders.order_status == 'shipped'].sort_values('order_delivered_carrier_date', ascending=False)
shipped.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
62360,54282e97f61c23b78330c15b154c867d,4b7decb9b58e2569548b8b4c8e20e8d7,shipped,2018-09-03 09:06:57,2018-09-03 17:40:06,2018-09-04 15:25:00,NaT,2018-09-06
22575,6ca46f2b9a1592929647682510800e0e,13bf775a749925a15ef7cc1985b564f1,shipped,2018-08-24 17:02:19,2018-08-24 17:15:10,2018-08-27 15:15:00,NaT,2018-08-29
29651,f73b31435ce6dec43df056154c39a1ce,f0f671d4034e98cdf20f0c452d6db02b,shipped,2018-08-20 12:37:54,2018-08-20 15:35:42,2018-08-24 16:48:00,NaT,2018-08-27
38209,2b59ddd3b4175dd13033a71b56785a33,a192e83217022c634d89c73d018d8251,shipped,2018-08-18 11:59:18,2018-08-20 11:33:14,2018-08-23 14:57:00,NaT,2018-08-28
24944,99b3fb1a943fa5d4af2a3386f00fdd19,aa03e52d50af7237a5963ffb09dd8872,shipped,2018-08-22 09:01:17,2018-08-22 09:10:17,2018-08-23 14:09:00,NaT,2018-08-30


In [25]:
shipped.shape

(1107, 8)

In [27]:
pd.concat([shipped.head(1), shipped.tail(1)])

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
62360,54282e97f61c23b78330c15b154c867d,4b7decb9b58e2569548b8b4c8e20e8d7,shipped,2018-09-03 09:06:57,2018-09-03 17:40:06,2018-09-04 15:25:00,NaT,2018-09-06
32371,0efd0bc268d34da3f01f4ff25a6d4335,95446917717bb58d553d107d0f1668f6,shipped,2016-10-07 15:53:31,2016-10-07 16:12:22,2016-10-11 16:12:23,NaT,2016-12-13


The dates range from 2016-10-11 to 2018-09-04, with order dates a day before shipping.

**Interim conclusion**
- It's possible that we still need to receive data from the logistics company regarding the delivery date of the user's order. Therefore, we may treat these orders as delivered and treat them as a purchase.

#### Invoiced status

Let's verify if the `invoiced` status only contains the latest paid orders pending delivery and if there are no other issues with the status.

In [30]:
invoiced = orders.loc[orders.order_status == 'invoiced'].sort_values('order_purchase_timestamp', ascending=False)
invoiced.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
56185,c3fd18a4aeb8e2d589b0248ce60e91b7,17c867817e32d1474241a11bb120d4c4,invoiced,2018-08-14 18:45:08,2018-08-14 19:05:23,NaT,NaT,2018-08-27
82498,67d06af97696c95df2bb26acf82db637,d3ddbbd220a088a3e353cd270c184e60,invoiced,2018-08-09 17:18:57,2018-08-09 17:50:12,NaT,NaT,2018-08-17
35918,a03ac1d748cc1533566bf2c8199159d4,d89f76444e908c953bd1f8390aec9a00,invoiced,2018-08-09 06:43:29,2018-08-09 06:55:06,NaT,NaT,2018-08-24
82392,2da6df6ecc3f69f3642ce1fafad85d5a,d04b499809157673780addf5719f7af1,invoiced,2018-08-07 22:22:53,2018-08-07 22:35:23,NaT,NaT,2018-08-15
70615,b81dc96814c73196e795524a469a2c35,ed3f4ba1e8981bd3773207b41a9de82b,invoiced,2018-08-07 19:13:51,2018-08-07 19:25:15,NaT,NaT,2018-08-21


In [32]:
invoiced.shape

(314, 8)

In [33]:
pd.concat([invoiced.head(1), invoiced.tail(1)])

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
56185,c3fd18a4aeb8e2d589b0248ce60e91b7,17c867817e32d1474241a11bb120d4c4,invoiced,2018-08-14 18:45:08,2018-08-14 19:05:23,NaT,NaT,2018-08-27
78824,dd359d3c294458c6d642b2eea9212bf5,5c58d1ea5a893380ecdd96dd6dfd5ec5,invoiced,2016-10-04 13:02:10,2016-10-05 03:08:27,NaT,NaT,2016-11-24


The date ranges for these orders are between October 4th, 2016, and August 14th, 2018. None of these orders were forwarded for delivery, and the last order date is almost two months before the previous date in our dataframe. Therefore, it's important to note that this is **not** the last order. 
- It's highly likely that the data regarding these orders has been lost.

#### `Processing` status:

In [34]:
processing = orders.loc[orders.order_status == 'processing'].sort_values('order_purchase_timestamp', ascending=False)
processing.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
61386,8bc9548dbe844e1bf48ac197c5609045,d4dc57fd18dfe3e30be4d066d873d388,processing,2018-07-23 18:03:03,2018-07-24 10:32:19,NaT,NaT,2018-08-13
28651,83fc33b62b8c7c39e2258d081955143d,c6eb22e4e7b7e5b871c06250c3296ef9,processing,2018-05-15 20:00:48,2018-05-17 02:55:42,NaT,NaT,2018-05-30
20012,8124e0a6295df5f9ce4377ca0a8e0c18,6e14fc1f239d4384a91b043639b2e3b7,processing,2018-05-08 15:27:57,2018-05-09 17:36:39,NaT,NaT,2018-06-05
15581,eb3c78fe8b35f52d4369d69d383fc212,263f9cd6ec31b2a2f80e4e26cb235335,processing,2018-05-07 22:40:38,2018-05-07 22:52:41,NaT,NaT,2018-06-04
20069,57c432fd96e9d52639f048eb1b2a5b10,fdd0411bec46e2d9a15590016b897dc3,processing,2018-05-07 19:30:29,2018-05-09 04:31:31,NaT,NaT,2018-06-04


In [35]:
processing.shape

(301, 8)

In [36]:
pd.concat([processing.head(1), processing.tail(1)])

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
61386,8bc9548dbe844e1bf48ac197c5609045,d4dc57fd18dfe3e30be4d066d873d388,processing,2018-07-23 18:03:03,2018-07-24 10:32:19,NaT,NaT,2018-08-13
324,d3c8851a6651eeff2f73b0e011ac45d0,957f8e082185574de25992dc659ebbc0,processing,2016-10-05 22:44:13,2016-10-06 15:51:05,NaT,NaT,2016-12-09


Orders with `processing` status can be found throughout the observation period. The Max date is 2018-07-23, more than three months before our last date in the dataframe. Thus, the `processing` status also does **not** describe the latest purchases.

**Conclusion**
- As orders with `shipped` status were delivered to the logistics company from us, we can consider them as completed purchases. None of the other statuses can be considered as a purchase.

#### As a result, only the orders with `delivered` and `shipped` statuses can be considered as **purchase**.

---

### items

In [37]:
items = pd.read_csv('olist_order_items_dataset.csv', parse_dates=['shipping_limit_date'])
items.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


In [38]:
items.shape

(112650, 7)

This table has 112650 rows, which is considerably higher compared to the previously explored dataframes.

In [39]:
items.nunique()

order_id               98666
order_item_id             21
product_id             32951
seller_id               3095
shipping_limit_date    93318
price                   5968
freight_value           6999
dtype: int64

We have 98666 orders listed in the current dataframe, while in the `orders` table, we had 99441 entries.

Let's check the statuses of the orders that are not present in the current dataframe.

In [40]:
orders_not_in_items = orders.loc[~orders.order_id.isin(items.order_id)]
orders_not_in_items.order_status.value_counts()

order_status
unavailable    603
canceled       164
created          5
invoiced         2
shipped          1
Name: count, dtype: int64

As only the `shipped` status from the listed ones is considered a purchase, let's check this order.

In [41]:
orders_not_in_items.loc[orders_not_in_items.order_status == 'shipped']

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
23254,a68ce1686d536ca72bd2dadc4b8671e5,d7bed5fac093a4136216072abaf599d5,shipped,2016-10-05 01:47:40,2016-10-07 03:11:22,2016-11-07 16:37:37,NaT,2016-12-01


In [42]:
customers.loc[customers.customer_id == 'd7bed5fac093a4136216072abaf599d5']

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
54448,d7bed5fac093a4136216072abaf599d5,f15a952dfc52308d0361288fbf42c7b3,91250,porto alegre,RS


The order `a68ce1686d536ca72bd2dadc4b8671e5` from user `d7bed5fac093a4136216072abaf599d5` is not available in the items dataframe, while the rest are present.
Most likely, details about this order needed to be recovered. We can continue treating `shipped` status as a purchase.

**Summary**
- We can treat both `delivered` and `shipped` statuses as **purchases**. All `delivered` statuses are present in all dataframes. One `shipped` status is missing in the `items` dataframe. We can ignore this order while still considering the `shipped` status as a purchase.