<a href="https://colab.research.google.com/github/kwanhong66/TodayILearned/blob/master/data_science/pandas/pandas_exercise/01_Getting_%26_Knowing_Your_Data/Chipotle/Exercise_with_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ex2 - Getting and Knowing your Data

Check out [Chipotle Exercises Video](https://www.youtube.com/watch?v=lpuYZ5EUyS8&list=PLgJhDSE2ZLxaY_DigHeiIDC1cD09rXgJv&index=2) Tutorial to watch a data scientist go through the exercises

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [0]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [0]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
# filepath_or_buffer : str, path object or file-like object
# Any valid string path is acceptable. The string could be a URL. 
# Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected.

# local의 파일 경로 또는 file url을 사용하여 read 가능
# tsv는 '\t'로 컬럼이 분리되어 있음 (default는 ','로 csv를 처리)

# single order consists of several items
# single order is flattened in chipotle dataset
# each order can be deferentiated by 'order_id'
chipo = pd.read_csv(url, sep='\t')

### Step 4. See the first 10 entries

In [61]:
# Return the first n rows
chipo.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


### Step 5. What is the number of observations in the dataset?

In [62]:
# Solution 1
# pandas.Index
# Immutable ndarray implementing an ordered, sliceable set.
# RangeIndex, CategoricalIndex, MultiIndex, DatatimeIndex ...
# len(chipo.index)

# pandas.Dataframe.info
# Print a concise summary of a DataFrame.
# This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
chipo.info() # shows that dataframe has 4,622 entries


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id              4622 non-null int64
quantity              4622 non-null int64
item_name             4622 non-null object
choice_description    3376 non-null object
item_price            4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


In [63]:
# Solution 2
# pandas.Dataframe.shape
# Return a tuple representing the dimensionality of the DataFrame.
# (4622, 5); row, col
chipo.shape[0]


4622

### Step 6. What is the number of columns in the dataset?

In [64]:
chipo.shape[1]

5

### Step 7. Print the name of all the columns.

In [65]:
# The column labels of the DataFrame.
chipo.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

### Step 8. How is the dataset indexed?

In [66]:
# chipo dataset is indexed with RangeIndex (default index type)

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.RangeIndex.html
# Immutable Index implementing a monotonic integer range.
# RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. 
# Using RangeIndex may in some instances improve computing speed.
chipo.index

RangeIndex(start=0, stop=4622, step=1)

### Step 9. Which was the most-ordered item? 

In [67]:
# most-ordered item means the item which has the highest quantity
# an entry of chipo dataframe corresponds with the order 
# that shows how many items are oredered including price

# we want to see groupby aggregation, in this case, by 'item_name'
# https://rfriend.tistory.com/391
# 1. Firstly, groupby with 'item_name'
# 2. To get total ordered quantity of item, do groupby aggregation using sum method
# 3. To check the oredered frequency, sort_values method is used to groupby agg with descending order
groupby_item_sum = chipo.groupby(by='item_name').sum()
sorted_by_quantity = groupby_item_sum.sort_values('quantity', ascending=False)
sorted_by_quantity.head(1)

Unnamed: 0_level_0,order_id,quantity
item_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Chicken Bowl,713926,761


### Step 10. For the most-ordered item, how many items were ordered?

In [68]:
groupby_item = chipo.groupby('item_name')
groupby_item_sum = groupby_item.sum()
c = groupby_item_sum.sort_values('quantity', ascending=False)
c.head(1)

Unnamed: 0_level_0,order_id,quantity
item_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Chicken Bowl,713926,761


### Step 11. What was the most ordered item in the choice_description column?

In [69]:
c = chipo.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

Unnamed: 0_level_0,order_id,quantity
choice_description,Unnamed: 1_level_1,Unnamed: 2_level_1
[Diet Coke],123455,159


### Step 12. How many items were orderd in total?

In [70]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html#pandas.Series.sum
# aggregate sum of quantity
total_number_of_items = chipo['quantity'].sum()
total_number_of_items

4972

### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [71]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dtype.html
chipo['item_price'].dtype

dtype('O')

#### Step 13.b. Create a lambda function and change the type of item price

In [0]:
make_dollar_to_float = lambda x: float(x[1:])
chipo['item_price'] = chipo['item_price'].apply(make_dollar_to_float)

#### Step 13.c. Check the item price type

In [73]:
# 'item_price' column was Object dtype and string value which means dollar (ex: $11.75)
# create lambda function to convert string to float
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html
chipo['item_price'].dtype

dtype('float64')

### Step 14. How much was the revenue for the period in the dataset?

In [74]:
# quantity of each order has to be multiplied for total quantity of all orders
total_revenue = (chipo['quantity'] * chipo['item_price']).sum()
print("Revenue is ${}".format(total_revenue))

Revenue is $39237.02


### Step 15. How many orders were made in the period?

In [85]:
# we need to unique count of 'order_id'
# but in dataset single order is flattened into several rows with items

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html
# Hash table-based unique. Uniques are returned in order of appearance. This does NOT sort.
# Significantly faster than numpy.unique. Includes NA values.
count_by_order_id = chipo['order_id'].unique()
len(count_by_order_id)

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
# value_counts method is used when unique values are counted
# Return a Series containing counts of unique values.

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.count.html#pandas.Series.count
# Return number of non-NA/null observations in the Series.
orders = chipo['order_id'].value_counts().count()
orders

1834

### Step 16. What is the average revenue amount per order?

In [92]:
# Solution 1
# 1. calculate revenue amount per order
# 2. calculate mean of all order revenue
# create new column for entry revenue
chipo['revenue'] = chipo['quantity'] * chipo['item_price']
groupby_order = chipo.groupby('order_id')
groupby_order_revenue_sum = groupby_order.sum()
groupby_order_revenue_sum.mean()['revenue']

21.394231188658654

In [94]:
# Solution 2
chipo.groupby('order_id').sum().mean()['revenue']

21.394231188658654

### Step 17. How many different items are sold?

In [87]:
chipo['item_name'].value_counts().count()

50