<a href="https://colab.research.google.com/github/Mai-Binh-Nam/Hand-on_DataAnalysisVEF/blob/master/%5BStudents%5D_Python_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Package `pandas` and dataframe structure

In data analysis, pandas is one of the most important libraries we will need to know. As discussed in class, structured data in reality are stored in tabular format such as the order table from the case study. `Dataframe` structure from `pandas` library is very convenient for this purpose.

## 1. Dataframe information

To start using pandas, import it to our notebook

In [None]:
import pandas as pd

We can create a dataframe by inputing the data directly in python command. But often, we read tables from external sources such as csv files and store them in a dataframe. Let's read the order table as done before.

In [None]:
# Orders
orders = pd.read_csv('https://raw.githubusercontent.com/thuynh386/olist_ecommerce_dataset/master/olist_orders_dataset.csv')

Now we have a dataframe `orders` containing information about every order made on Olist during the time period in consideration.

`head()` is a function to explore a dataframe by printing several (default 5) first rows. `tail()` prints several (default 5) last rows of a dataframe.

In [None]:
orders.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


Dataframe is always indexed. If we don't specify an index, it will be automatically indexed by integers as seen above. In Python, index starts at 0 (zero-indexing). An index refers to the position of an element in the data structure.

We can see some information about a dataframe by using `shape` and `columns` attributes.

In [None]:
# 99,441 rows, 8 columns
orders.shape

(99441, 8)

In [None]:
# List of table columns
orders.columns

Index(['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp',
       'order_approved_at', 'order_delivered_carrier_date',
       'order_delivered_customer_date', 'order_estimated_delivery_date'],
      dtype='object')

`info()` provides some other information about the dataframe. Here, we can see the column names, the number of non-null rows for each column and the data type of each column.

In [None]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


## 2. Slicing & Subsetting
Very often, we only want to work a part of a dataframe such as the orders of category "bebes". Dataframe support operations to achieve this goal by:
1. Slicing
1. Subsetting

### 2.1. Slicing
If we want to select rows in a given range, we can do that by the slicing operation. The syntax is `data[row range]`.

In [None]:
# Select rows from 5 to 15 from orders
orders[5:16]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
5,a4591c265e18cb1dcee52889e2d8acc3,503740e9ca751ccdda7ba28e9ab8f608,delivered,2017-07-09 21:57:05,2017-07-09 22:10:13,2017-07-11 14:58:04,2017-07-26 10:57:55,2017-08-01 00:00:00
6,136cce7faa42fdb2cefd53fdc79a6098,ed0271e0b7da060a393796590e7b737a,invoiced,2017-04-11 12:22:08,2017-04-13 13:25:17,,,2017-05-09 00:00:00
7,6514b8ad8028c9f2cc2374ded245783f,9bdf08b4b3b52b5526ff42d37d47f222,delivered,2017-05-16 13:10:30,2017-05-16 13:22:11,2017-05-22 10:07:46,2017-05-26 12:55:51,2017-06-07 00:00:00
8,76c6e866289321a7c93b82b54852dc33,f54a9f0e6b351c431402b8461ea51999,delivered,2017-01-23 18:29:09,2017-01-25 02:50:47,2017-01-26 14:16:31,2017-02-02 14:08:10,2017-03-06 00:00:00
9,e69bfb5eb88e0ed6a785585b27e16dbf,31ad1d1b63eb9962463f764d4e6e0c9d,delivered,2017-07-29 11:55:02,2017-07-29 12:05:32,2017-08-10 19:45:24,2017-08-16 17:14:30,2017-08-23 00:00:00
10,e6ce16cb79ec1d90b1da9085a6118aeb,494dded5b201313c64ed7f100595b95c,delivered,2017-05-16 19:41:10,2017-05-16 19:50:18,2017-05-18 11:40:40,2017-05-29 11:18:31,2017-06-07 00:00:00
11,34513ce0c4fab462a55830c0989c7edb,7711cf624183d843aafe81855097bc37,delivered,2017-07-13 19:58:11,2017-07-13 20:10:08,2017-07-14 18:43:29,2017-07-19 14:04:48,2017-08-08 00:00:00
12,82566a660a982b15fb86e904c8d32918,d3e3b74c766bc6214e0c830b17ee2341,delivered,2018-06-07 10:06:19,2018-06-09 03:13:12,2018-06-11 13:29:00,2018-06-19 12:05:52,2018-07-18 00:00:00
13,5ff96c15d0b717ac6ad1f3d77225a350,19402a48fe860416adf93348aba37740,delivered,2018-07-25 17:44:10,2018-07-25 17:55:14,2018-07-26 13:16:00,2018-07-30 15:52:25,2018-08-08 00:00:00
14,432aaf21d85167c2c86ec9448c4e42cc,3df704f53d3f1d4818840b34ec672a9f,delivered,2018-03-01 14:14:28,2018-03-01 15:10:47,2018-03-02 21:09:20,2018-03-12 23:36:26,2018-03-21 00:00:00


**Question: Why is the index here from 5 to 16?**


In [None]:
# Or rows from 0 to 20. If rows we want start from 0, we can omit 0 in the range.
orders[:21]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00
5,a4591c265e18cb1dcee52889e2d8acc3,503740e9ca751ccdda7ba28e9ab8f608,delivered,2017-07-09 21:57:05,2017-07-09 22:10:13,2017-07-11 14:58:04,2017-07-26 10:57:55,2017-08-01 00:00:00
6,136cce7faa42fdb2cefd53fdc79a6098,ed0271e0b7da060a393796590e7b737a,invoiced,2017-04-11 12:22:08,2017-04-13 13:25:17,,,2017-05-09 00:00:00
7,6514b8ad8028c9f2cc2374ded245783f,9bdf08b4b3b52b5526ff42d37d47f222,delivered,2017-05-16 13:10:30,2017-05-16 13:22:11,2017-05-22 10:07:46,2017-05-26 12:55:51,2017-06-07 00:00:00
8,76c6e866289321a7c93b82b54852dc33,f54a9f0e6b351c431402b8461ea51999,delivered,2017-01-23 18:29:09,2017-01-25 02:50:47,2017-01-26 14:16:31,2017-02-02 14:08:10,2017-03-06 00:00:00
9,e69bfb5eb88e0ed6a785585b27e16dbf,31ad1d1b63eb9962463f764d4e6e0c9d,delivered,2017-07-29 11:55:02,2017-07-29 12:05:32,2017-08-10 19:45:24,2017-08-16 17:14:30,2017-08-23 00:00:00


To select certain columns only, provide the column names as below.

In [None]:
# Choose column order_status only
orders[["order_status"]] # With column name printed out

Unnamed: 0,order_status
0,delivered
1,delivered
2,delivered
3,delivered
4,delivered
...,...
99436,delivered
99437,delivered
99438,delivered
99439,delivered


In [None]:
orders["order_status"] # Without column name printed out

0        delivered
1        delivered
2        delivered
3        delivered
4        delivered
           ...    
99436    delivered
99437    delivered
99438    delivered
99439    delivered
99440    delivered
Name: order_status, Length: 99441, dtype: object

In [None]:
# Select several columns at once
orders[["order_status", "order_purchase_timestamp"]]

Unnamed: 0,order_status,order_purchase_timestamp
0,delivered,2017-10-02 10:56:33
1,delivered,2018-07-24 20:41:37
2,delivered,2018-08-08 08:38:49
3,delivered,2017-11-18 19:28:06
4,delivered,2018-02-13 21:18:39
...,...,...
99436,delivered,2017-03-09 09:54:05
99437,delivered,2018-02-06 12:58:58
99438,delivered,2017-08-27 14:46:43
99439,delivered,2018-01-08 21:28:27


### 2.2. Subsetting by location and labels

If we want to select given rows and columns at the same time, we have to use subsetting with either:
1. `iloc`: specify rows and columns by integer locations.
1. `loc`: specify rows and columns by *label*.

`iloc`: Select rows from 0 to 20 (21 rows) and columns from 2 to 5 (4 columns).

In [None]:
orders.iloc[0:21, 2:5]

Unnamed: 0,order_status,order_purchase_timestamp,order_approved_at
0,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15
1,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27
2,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23
3,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59
4,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29
5,delivered,2017-07-09 21:57:05,2017-07-09 22:10:13
6,invoiced,2017-04-11 12:22:08,2017-04-13 13:25:17
7,delivered,2017-05-16 13:10:30,2017-05-16 13:22:11
8,delivered,2017-01-23 18:29:09,2017-01-25 02:50:47
9,delivered,2017-07-29 11:55:02,2017-07-29 12:05:32


In [None]:
# Select 1 element from the dataframe: Row number 3, first column
orders.iloc[3, 0]

'949d5b44dbf5de918fe9c16f97b45f8a'

`loc`: Rows here are labeled by integers and columns are labeled by column names.

In [None]:
orders.loc[0:10, ["order_id","order_status","order_purchase_timestamp"]]

Unnamed: 0,order_id,order_status,order_purchase_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,delivered,2017-10-02 10:56:33
1,53cdb2fc8bc7dce0b6741e2150273451,delivered,2018-07-24 20:41:37
2,47770eb9100c2d0c44946d9cf07ec65d,delivered,2018-08-08 08:38:49
3,949d5b44dbf5de918fe9c16f97b45f8a,delivered,2017-11-18 19:28:06
4,ad21c59c0840e6cb83a9ceb5573f8159,delivered,2018-02-13 21:18:39
5,a4591c265e18cb1dcee52889e2d8acc3,delivered,2017-07-09 21:57:05
6,136cce7faa42fdb2cefd53fdc79a6098,invoiced,2017-04-11 12:22:08
7,6514b8ad8028c9f2cc2374ded245783f,delivered,2017-05-16 13:10:30
8,76c6e866289321a7c93b82b54852dc33,delivered,2017-01-23 18:29:09
9,e69bfb5eb88e0ed6a785585b27e16dbf,delivered,2017-07-29 11:55:02


If the label does not exist, we get an error.

In [None]:
orders.loc[0:10, [0,4,5]]

KeyError: ignored

Also, compare the results of the two operations below. What is the difference that needs to be remembered?

In [None]:
# Using iloc to subset
orders.iloc[:10, 0:3]

Unnamed: 0,order_id,customer_id,order_status
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered
5,a4591c265e18cb1dcee52889e2d8acc3,503740e9ca751ccdda7ba28e9ab8f608,delivered
6,136cce7faa42fdb2cefd53fdc79a6098,ed0271e0b7da060a393796590e7b737a,invoiced
7,6514b8ad8028c9f2cc2374ded245783f,9bdf08b4b3b52b5526ff42d37d47f222,delivered
8,76c6e866289321a7c93b82b54852dc33,f54a9f0e6b351c431402b8461ea51999,delivered
9,e69bfb5eb88e0ed6a785585b27e16dbf,31ad1d1b63eb9962463f764d4e6e0c9d,delivered


In [None]:
# Using loc to subset
orders.loc[:10, ["order_id", "customer_id","order_status"]]

Unnamed: 0,order_id,customer_id,order_status
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered
5,a4591c265e18cb1dcee52889e2d8acc3,503740e9ca751ccdda7ba28e9ab8f608,delivered
6,136cce7faa42fdb2cefd53fdc79a6098,ed0271e0b7da060a393796590e7b737a,invoiced
7,6514b8ad8028c9f2cc2374ded245783f,9bdf08b4b3b52b5526ff42d37d47f222,delivered
8,76c6e866289321a7c93b82b54852dc33,f54a9f0e6b351c431402b8461ea51999,delivered
9,e69bfb5eb88e0ed6a785585b27e16dbf,31ad1d1b63eb9962463f764d4e6e0c9d,delivered


### 2.3. Subsetting by conditions
We can also select rows of a dataframe by specifying conditions. The result contains only rows which satisfy our conditions.

Conditions available:

1. Equals: `==`
1. Not equals: `!=`
1. Greater than, less than: `>` or `<`
1. Greater than or equal to `>=`
1. Less than or equal to `<=`


Syntax for 1 condition:



In [None]:
# Select only canceled orders
orders[orders["order_status"] == "canceled"]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
397,1b9ecfe83cdc259250e1a8aca174f0ad,6d6b50b66d79f80827b6d96751528d30,canceled,2018-08-04 14:29:27,2018-08-07 04:10:26,,,2018-08-14 00:00:00
613,714fb133a6730ab81fa1d3c1b2007291,e3fe72696c4713d64d3c10afe71e75ed,canceled,2018-01-26 21:34:08,2018-01-26 21:58:39,2018-01-29 22:33:25,,2018-02-22 00:00:00
1058,3a129877493c8189c59c60eb71d97c29,0913cdce793684e52bbfac69d87e91fd,canceled,2018-01-25 13:34:24,2018-01-25 13:50:20,2018-01-26 21:42:18,,2018-02-23 00:00:00
1130,00b1cb0320190ca0daa2c88b35206009,3532ba38a3fd242259a514ac2b6ae6b6,canceled,2018-08-28 15:26:39,,,,2018-09-12 00:00:00
1801,ed3efbd3a87bea76c2812c66a0b32219,191984a8ba4cbb2145acb4fe35b69664,canceled,2018-09-20 13:54:16,,,,2018-10-17 00:00:00
...,...,...,...,...,...,...,...,...
98791,b159d0ce7cd881052da94fa165617b05,e0c3bc5ce0836b975d6b2a8ce7bb0e3e,canceled,2017-03-11 19:51:36,2017-03-11 19:51:36,,,2017-03-30 00:00:00
98909,e49e7ce1471b4693482d40c2bd3ad196,e4e7ab3f449aeb401f0216f86c2104db,canceled,2018-08-07 11:16:28,,,,2018-08-10 00:00:00
99143,6560fb10610771449cb0463c5ba12199,0d07d0a588caf93cc66b7a8aff86d2fe,canceled,2017-10-01 22:26:25,2017-10-01 22:35:22,,,2017-10-27 00:00:00
99283,3a3cddda5a7c27851bd96c3313412840,0b0d6095c5555fe083844281f6b093bb,canceled,2018-08-31 16:13:44,,,,2018-10-01 00:00:00


Syntax for multiple conditions. With more than 1 condition, we can use logical operations such as `AND (&)`, `OR (|)` to combine conditions. 

In [None]:
# Select only canceled orders by customer `e4e7ab3f449aeb401f0216f86c2104db`
orders[(orders["order_status"] == "canceled") & (orders["customer_id"]=="e4e7ab3f449aeb401f0216f86c2104db")]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
98909,e49e7ce1471b4693482d40c2bd3ad196,e4e7ab3f449aeb401f0216f86c2104db,canceled,2018-08-07 11:16:28,,,,2018-08-10 00:00:00


In [None]:
# Select canceled and shipped orders
orders[(orders["order_status"] == "canceled") | (orders["order_status"] == "shipped")]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
44,ee64d42b8cf066f35eac1cf57de1aa85,caded193e8e47b8362864762a83db3c5,shipped,2018-06-04 16:44:48,2018-06-05 04:31:18,2018-06-05 14:32:00,,2018-06-28 00:00:00
154,6942b8da583c2f9957e990d028607019,52006a9383bf149a4fb24226b173106f,shipped,2018-01-10 11:33:07,2018-01-11 02:32:30,2018-01-11 19:39:23,,2018-02-07 00:00:00
162,36530871a5e80138db53bcfd8a104d90,4dafe3c841d2d6cc8a8b6d25b35704b9,shipped,2017-05-09 11:48:37,2017-05-11 11:45:14,2017-05-11 13:21:47,,2017-06-08 00:00:00
231,4d630f57194f5aba1a3d12ce23e71cd9,6d491c9fe2f04f6e2af6ec033cd8907c,shipped,2017-11-17 19:53:21,2017-11-18 19:50:31,2017-11-22 17:28:34,,2017-12-13 00:00:00
299,3b4ad687e7e5190db827e1ae5a8989dd,1a87b8517b7d31373b50396eb15cb445,shipped,2018-06-28 12:52:15,2018-06-28 13:11:09,2018-07-04 15:20:00,,2018-08-03 00:00:00
...,...,...,...,...,...,...,...,...
99113,274a7f7e4f1c17b7434a830e9b8759b1,670af30ca5b8c20878fecdafa5ee01b9,shipped,2018-06-23 13:25:15,2018-06-23 13:40:11,2018-07-04 13:51:00,,2018-07-24 00:00:00
99143,6560fb10610771449cb0463c5ba12199,0d07d0a588caf93cc66b7a8aff86d2fe,canceled,2017-10-01 22:26:25,2017-10-01 22:35:22,,,2017-10-27 00:00:00
99181,636cdd02667dc8d76d9296bf20a6890a,c162256b133c76f79181ce61d66545db,shipped,2018-02-17 14:31:22,2018-02-20 07:11:31,2018-02-20 19:18:58,,2018-03-14 00:00:00
99283,3a3cddda5a7c27851bd96c3313412840,0b0d6095c5555fe083844281f6b093bb,canceled,2018-08-31 16:13:44,,,,2018-10-01 00:00:00


With conditions, we can still specify certain columns only after applying the conditions to select rows we want.

In [None]:
orders[(orders["order_status"] == "canceled") | (orders["order_status"] == "shipped")][["order_id","order_status","order_purchase_timestamp"]]

Unnamed: 0,order_id,order_status,order_purchase_timestamp
44,ee64d42b8cf066f35eac1cf57de1aa85,shipped,2018-06-04 16:44:48
154,6942b8da583c2f9957e990d028607019,shipped,2018-01-10 11:33:07
162,36530871a5e80138db53bcfd8a104d90,shipped,2017-05-09 11:48:37
231,4d630f57194f5aba1a3d12ce23e71cd9,shipped,2017-11-17 19:53:21
299,3b4ad687e7e5190db827e1ae5a8989dd,shipped,2018-06-28 12:52:15
...,...,...,...
99113,274a7f7e4f1c17b7434a830e9b8759b1,shipped,2018-06-23 13:25:15
99143,6560fb10610771449cb0463c5ba12199,canceled,2017-10-01 22:26:25
99181,636cdd02667dc8d76d9296bf20a6890a,shipped,2018-02-17 14:31:22
99283,3a3cddda5a7c27851bd96c3313412840,canceled,2018-08-31 16:13:44


## 3. Summary statistics

With a pandas dataframe, there is a very convenient function which returns summary statistics for all columns: `describe()`.

Let's explore the `order_items` table as it has more numerical columns.

In [None]:
# Order items
order_items = pd.read_csv('https://raw.githubusercontent.com/thuynh386/olist_ecommerce_dataset/master/olist_order_items_dataset.csv')

In [None]:
order_items.head(10)

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14
5,00048cc3ae777c65dbb7d2a0634bc1ea,1,ef92defde845ab8450f9d70c526ef70f,6426d21aca402a131fc0a5d0960a3c90,2017-05-23 03:55:27,21.9,12.69
6,00054e8431b9d7675808bcb819fb4a32,1,8d4f2bb7e93e6710a28f34fa83ee7d28,7040e82f899a04d1b434b795a43b4617,2017-12-14 12:10:31,19.9,11.85
7,000576fe39319847cbb9d288c5617fa6,1,557d850972a7d6f792fd18ae1400d9b6,5996cddab893a4652a15592fb58ab8db,2018-07-10 12:30:45,810.0,70.75
8,0005a1a1728c9d785b8e2b08b904576c,1,310ae3c140ff94b03219ad0adc3c778f,a416b6a846a11724393025641d4edd5e,2018-03-26 18:31:29,145.95,11.65
9,0005f50442cb953dcd1d21e1fb923495,1,4535b0e1091c278dfd193e5a1d63b39f,ba143b05f0110f0dc71ad71b4466ce92,2018-07-06 14:10:56,53.99,11.4


In [None]:
order_items.describe()

Unnamed: 0,order_item_id,price,freight_value
count,112650.0,112650.0,112650.0
mean,1.197834,120.653739,19.99032
std,0.705124,183.633928,15.806405
min,1.0,0.85,0.0
25%,1.0,39.9,13.08
50%,1.0,74.99,16.26
75%,1.0,134.9,21.15
max,21.0,6735.0,409.68


If we want, we can also calculate a certain summary statistics ourselves. For instance, we are interested in mean value for `freight_value`. Let's subset it from the dataframe and apply the `mean` function to it.

In [None]:
order_items["freight_value"].mean()

19.99031992898562

In [None]:
# Or median
order_items["freight_value"].median()

16.26

In [None]:
# Or count the number of non-null rows for price
order_items["price"].count()

112650

In [None]:
# Or how many unique values for price in this dataset? As seen below, many items have the same price so the number of unique price is much smaller.
order_items["price"].nunique()

5968

## 4. Pratice with `order_payment` table

Note: one order can be paid for by several payment methods. For more information about each column, refer to the Kaggle website.

In [None]:
# Order payment
order_payments = pd.read_csv('https://raw.githubusercontent.com/thuynh386/olist_ecommerce_dataset/master/olist_order_payments_dataset.csv')

In [None]:
order_payments.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45


In [None]:
# How many payment methods are there?
order_payments["payment_type"].unique()

array(['credit_card', 'boleto', 'voucher', 'debit_card', 'not_defined'],
      dtype=object)

Question 1: Select only orders paid by credit card.

In [None]:
# Write your code here. 



Question 2: What is the average of all values in `payment_value` column?

In [None]:
# Write your code here. 



Question 3: Select rows from 1000 to 2000 and exclude column `payment_sequential`.

In [None]:
# Write your code here. 



Question 4: How many choices of payment installments do customers have?

In [None]:
# Write your code here. 



Question 5: Select only orders which have been paid either by credit card or boleto and having the payment value greater or equal to 100.

In [None]:
# Write your code here. 



Question 6: For dataframe, there is the `quantile` function to calculate percentile values ([Link here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)). See if you can calculate the percentile 30, 60, 90 and 95 for `payment_value` column.

In [None]:
# Write your code here. 

