<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing Chipotle Data

_Author: Joseph Nelson (DC)_

---

For Project 2, you will complete a series of exercises exploring [order data from Chipotle](https://github.com/TheUpshot/chipotle), compliments of _The New York Times'_ "The Upshot."

For these exercises, you will conduct basic exploratory data analysis (Pandas not required) to understand the essentials of Chipotle's order data: how many orders are being made, the average price per order, how many different ingredients are used, etc. These allow you to practice business analysis skills while also becoming comfortable with Python.

---

## Basic Level

### Part 1: Read in the file with `csv.reader()` and store it in an object called `file_nested_list`.

Hint: This is a TSV (tab-separated value) file, and `csv.reader()` needs to be told [how to handle it](https://docs.python.org/2/library/csv.html).

In [1]:
import csv
from collections import namedtuple   # Convenient to store the data rows

DATA_FILE = './data/chipotle.tsv'

In [2]:
file_nested_list = []
with open(DATA_FILE,'r') as tsvfile:
    file_input = csv.reader(tsvfile, delimiter='\t')
    for row in file_input:
        file_nested_list.append(row)


In [3]:
file_nested_list[0]

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

In [4]:
file_nested_list[1]

['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 ']

In [5]:
len(file_nested_list)

4623

In [6]:
file_nested_list

[['order_id', 'quantity', 'item_name', 'choice_description', 'item_price'],
 ['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 '],
 ['1', '1', 'Izze', '[Clementine]', '$3.39 '],
 ['1', '1', 'Nantucket Nectar', '[Apple]', '$3.39 '],
 ['1', '1', 'Chips and Tomatillo-Green Chili Salsa', 'NULL', '$2.39 '],
 ['2',
  '2',
  'Chicken Bowl',
  '[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]',
  '$16.98 '],
 ['3',
  '1',
  'Chicken Bowl',
  '[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$10.98 '],
 ['3', '1', 'Side of Chips', 'NULL', '$1.69 '],
 ['4',
  '1',
  'Steak Burrito',
  '[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
  '$11.75 '],
 ['4',
  '1',
  'Steak Soft Tacos',
  '[Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]]',
  '$9.25 '],
 ['5',
  '1',
  'Steak Burrito',
  '[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Ch

### Part 2: Separate `file_nested_list` into the `header` and the `data`.


In [6]:
header = file_nested_list[0]

In [7]:
len(header)

5

In [8]:
header

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

In [9]:
header[0]

'order_id'

In [10]:
data = file_nested_list[1:]

In [11]:
len(data)

4622

In [12]:
data[0]

['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 ']

---

## Intermediate Level

### Part 3: Calculate the average price of an order.

Hint: Examine the data to see if the `quantity` column is relevant to this calculation.

Hint: Think carefully about the simplest way to do this!

Step 1 - Convert the data to a list of dictionaries.

In [13]:
data_dicts = []
for r in data:
    data_dicts.append(
        {
            header[0]:r[0],
            header[1]:r[1],
            header[2]:r[2],
            header[3]:r[3],
            header[4]:r[4],
        }
    )

In [14]:
data_dicts[:4]

[{'order_id': '1',
  'quantity': '1',
  'item_name': 'Chips and Fresh Tomato Salsa',
  'choice_description': 'NULL',
  'item_price': '$2.39 '},
 {'order_id': '1',
  'quantity': '1',
  'item_name': 'Izze',
  'choice_description': '[Clementine]',
  'item_price': '$3.39 '},
 {'order_id': '1',
  'quantity': '1',
  'item_name': 'Nantucket Nectar',
  'choice_description': '[Apple]',
  'item_price': '$3.39 '},
 {'order_id': '1',
  'quantity': '1',
  'item_name': 'Chips and Tomatillo-Green Chili Salsa',
  'choice_description': 'NULL',
  'item_price': '$2.39 '}]

Step 2 - Clean up data_dicts so working with numbers.

In [15]:
# First clean up order_id / order_quantity
for item in data_dicts:
    for h in header:
        if item[h].isdigit():
            item[h] = int(item[h])
        

In [None]:
# then clean up item_price:

In [16]:
for item in data_dicts:
    item['item_price'] = item['item_price'].lstrip('$')
    item['item_price'] = item['item_price'].rstrip(' ')
    item['item_price'] = float(item['item_price'])

In [17]:
data_dicts[:3]

[{'order_id': 1,
  'quantity': 1,
  'item_name': 'Chips and Fresh Tomato Salsa',
  'choice_description': 'NULL',
  'item_price': 2.39},
 {'order_id': 1,
  'quantity': 1,
  'item_name': 'Izze',
  'choice_description': '[Clementine]',
  'item_price': 3.39},
 {'order_id': 1,
  'quantity': 1,
  'item_name': 'Nantucket Nectar',
  'choice_description': '[Apple]',
  'item_price': 3.39}]

In [18]:
[print(i) for i in data_dicts if i['order_id']==1]

{'order_id': 1, 'quantity': 1, 'item_name': 'Chips and Fresh Tomato Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 1, 'quantity': 1, 'item_name': 'Izze', 'choice_description': '[Clementine]', 'item_price': 3.39}
{'order_id': 1, 'quantity': 1, 'item_name': 'Nantucket Nectar', 'choice_description': '[Apple]', 'item_price': 3.39}
{'order_id': 1, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}


[None, None, None, None]

In [19]:
[i for i in data_dicts if i['quantity']>=2][:3]

[{'order_id': 2,
  'quantity': 2,
  'item_name': 'Chicken Bowl',
  'choice_description': '[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]',
  'item_price': 16.98},
 {'order_id': 9,
  'quantity': 2,
  'item_name': 'Canned Soda',
  'choice_description': '[Sprite]',
  'item_price': 2.18},
 {'order_id': 23,
  'quantity': 2,
  'item_name': 'Canned Soda',
  'choice_description': '[Mountain Dew]',
  'item_price': 2.18}]

In [168]:
[i for i in data_dicts if (i['choice_description']=="[Sprite]" and i['quantity'] > 2)][:10]

# it looks like the item_prace adjusts for the quantity automatically. e.g. $1.09 * 2 = $2.18

[{'order_id': 350,
  'quantity': 3,
  'item_name': 'Canned Soft Drink',
  'choice_description': '[Sprite]',
  'item_price': 3.75},
 {'order_id': 901,
  'quantity': 4,
  'item_name': 'Canned Soda',
  'choice_description': '[Sprite]',
  'item_price': 4.36},
 {'order_id': 1786,
  'quantity': 4,
  'item_name': 'Canned Soft Drink',
  'choice_description': '[Sprite]',
  'item_price': 5.0}]

In [20]:
[i for i in data_dicts if i['order_id'] == 1786]

# $1.25 is believeable for a can of soft drink, and it's consistent across soft drinks. Maybe a different state?

[{'order_id': 1786,
  'quantity': 1,
  'item_name': 'Chicken Bowl',
  'choice_description': '[Fresh Tomato Salsa, Rice]',
  'item_price': 8.75},
 {'order_id': 1786,
  'quantity': 1,
  'item_name': 'Carnitas Burrito',
  'choice_description': '[Fresh Tomato Salsa, [Fajita Vegetables, Rice, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
  'item_price': 11.75},
 {'order_id': 1786,
  'quantity': 1,
  'item_name': 'Chicken Bowl',
  'choice_description': '[Fresh Tomato Salsa, [Rice, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
  'item_price': 11.25},
 {'order_id': 1786,
  'quantity': 1,
  'item_name': 'Chicken Bowl',
  'choice_description': '[Fresh Tomato Salsa, [Fajita Vegetables, Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
  'item_price': 11.25},
 {'order_id': 1786,
  'quantity': 1,
  'item_name': 'Barbacoa Bowl',
  'choice_description': '[Fresh Tomato Salsa, [Fajita Vegetables, Rice, Black Beans, Guacamole, Lettuce]]',
  'item_price': 11.75},
 {'order_

In [57]:
sum([1,2,3])

6

Calculate the average order price (i.e. total cost of orders / number of distinct orders)

In [18]:
total_orders_cost = sum([i['item_price']  for i in data_dicts])

In [19]:
total_orders = max(i['order_id'] for i in data_dicts)

In [20]:
print(f"Average cost per order: {format(total_orders_cost/total_orders,'.2f')}")

Average cost per order: 18.81


### Part 4: Create a list (or set) named `unique_sodas` containing all of unique sodas and soft drinks that Chipotle sells.

Note: Just look for `'Canned Soda'` and `'Canned Soft Drink'`, and ignore other drinks like `'Izze'`.

In [21]:
all_sodas = [i['choice_description'] for i in data_dicts if i['item_name'][:6] == 'Canned']

In [22]:
all_sodas_clean = []

for i in all_sodas:
    ci = i[1:-1]
    all_sodas_clean.append(ci)

In [23]:
unique_sodas = list(set(all_sodas_clean))

In [24]:
unique_sodas

['Dr. Pepper',
 'Lemonade',
 'Nestea',
 'Coke',
 'Coca Cola',
 'Sprite',
 'Mountain Dew',
 'Diet Dr. Pepper',
 'Diet Coke']

It may make sense to combine "Coke" and "Coca Cola" for specific work related to soft drinks.

---

## Advanced Level


### Part 5: Calculate the average number of toppings per burrito.

Note: Let's ignore the `quantity` column to simplify this task.

Hint: Think carefully about the easiest way to count the number of toppings!


In [99]:
len(['1','2','3'])

3

In [105]:
[(i['choice_description']) for i in data_dicts if "Burrito" in i['item_name']]

['[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',
 '[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]]',
 '[Tomatillo-Green Chili Salsa (Medium), [Pinto Beans, Cheese, Sour Cream]]',
 '[Fresh Tomato Salsa (Mild), [Black Beans, Rice, Cheese, Sour Cream, Lettuce]]',
 '[[Fresh Tomato Salsa (Mild), Tomatillo-Green Chili Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Rice, Cheese, Sour Cream, Lettuce]]',
 '[[Tomatillo-Green Chili Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Pinto Beans, Rice, Cheese, Sour Cream, Guacamole, Lettuce]]',
 '[[Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium)], [Black Beans, Rice, Sour Cream, Lettuce]]',
 '[Tomatillo-Green Chili Salsa (Medium), [Pinto Beans, Rice, Cheese, Sour Cream]]',
 '[[Roasted Chili Corn Salsa (Medium), Fresh Tomato Salsa (Mild)], [Rice, Black Beans, Sour Cream]]',
 '[Fresh Tomato Salsa, [Rice, Pinto Beans, C

In [110]:
print('[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]'
     .replace('[','').replace(']','')
     )

Tomatillo Red Chili Salsa, Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce


In [126]:
toppings_per_burrito = [i['choice_description'].replace('[','').replace(']','') for i in data_dicts if "Burrito" in i['item_name']]

In [127]:
list(toppings_per_burrito[0].replace(' ','').split(','))

['TomatilloRedChiliSalsa',
 'FajitaVegetables',
 'BlackBeans',
 'PintoBeans',
 'Cheese',
 'SourCream',
 'Guacamole',
 'Lettuce']

In [128]:
toppings_per_burrito_clean = [list(i.replace(' ','').split(',')) for i in toppings_per_burrito]

In [129]:
toppings_per_burrito_clean

[['TomatilloRedChiliSalsa',
  'FajitaVegetables',
  'BlackBeans',
  'PintoBeans',
  'Cheese',
  'SourCream',
  'Guacamole',
  'Lettuce'],
 ['FreshTomatoSalsa',
  'Rice',
  'BlackBeans',
  'PintoBeans',
  'Cheese',
  'SourCream',
  'Lettuce'],
 ['Tomatillo-GreenChiliSalsa(Medium)', 'PintoBeans', 'Cheese', 'SourCream'],
 ['FreshTomatoSalsa(Mild)',
  'BlackBeans',
  'Rice',
  'Cheese',
  'SourCream',
  'Lettuce'],
 ['FreshTomatoSalsa(Mild)',
  'Tomatillo-GreenChiliSalsa(Medium)',
  'Tomatillo-RedChiliSalsa(Hot)',
  'Rice',
  'Cheese',
  'SourCream',
  'Lettuce'],
 ['Tomatillo-GreenChiliSalsa(Medium)',
  'Tomatillo-RedChiliSalsa(Hot)',
  'PintoBeans',
  'Rice',
  'Cheese',
  'SourCream',
  'Guacamole',
  'Lettuce'],
 ['Tomatillo-GreenChiliSalsa(Medium)',
  'RoastedChiliCornSalsa(Medium)',
  'BlackBeans',
  'Rice',
  'SourCream',
  'Lettuce'],
 ['Tomatillo-GreenChiliSalsa(Medium)',
  'PintoBeans',
  'Rice',
  'Cheese',
  'SourCream'],
 ['RoastedChiliCornSalsa(Medium)',
  'FreshTomatoSalsa(M

In [130]:
avg_toppings_per_burrito = sum([len(i) for i in toppings_per_burrito_clean])/len(toppings_per_burrito_clean)

In [131]:
print(f"Average toppings per burrito: {format(avg_toppings_per_burrito,'f')}")

Average toppings per burrito: 5.395051



**Consolidated set of code to determine average:**



In [133]:
# clean up the dictionary output slightly
toppings_per_burrito = [i['choice_description'].replace('[','').replace(']','') for i in data_dicts if "Burrito" in i['item_name']]

# clean it up some more to get the toppings into a single nested list
toppings_per_burrito_clean = [list(i.replace(' ','').split(',')) for i in toppings_per_burrito]

# calculate the average toppings per burrito
avg_toppings_per_burrito = sum([len(i) for i in toppings_per_burrito_clean])/len(toppings_per_burrito_clean)

# print the output
print(f"Average toppings per burrito: {format(avg_toppings_per_burrito,'f')}")

Average toppings per burrito: 5.395051


### Part 6: Create a dictionary. Let the keys represent chip orders and the values represent the total number of orders.

Expected output: `{'Chips and Roasted Chili-Corn Salsa': 18, ... }`

Note: Please take the `quantity` column into account!

Optional: Learn how to use `.defaultdict()` to simplify your code.

In [30]:
# Get a list of all cases where chips were ordered:
chips_orderes = []
chips_orders = [i for i in data_dicts if 'chips' in i['item_name'].lower()]


In [29]:
"ABC".lower()

'abc'

In [31]:
len(chips_orders)

1084

In [32]:
chips_orders[:4]

[{'order_id': 1,
  'quantity': 1,
  'item_name': 'Chips and Fresh Tomato Salsa',
  'choice_description': 'NULL',
  'item_price': 2.39},
 {'order_id': 1,
  'quantity': 1,
  'item_name': 'Chips and Tomatillo-Green Chili Salsa',
  'choice_description': 'NULL',
  'item_price': 2.39},
 {'order_id': 3,
  'quantity': 1,
  'item_name': 'Side of Chips',
  'choice_description': 'NULL',
  'item_price': 1.69},
 {'order_id': 5,
  'quantity': 1,
  'item_name': 'Chips and Guacamole',
  'choice_description': 'NULL',
  'item_price': 4.45}]

In [33]:
[print(i) for i in chips_orders if "-" in i['item_name']]

{'order_id': 1, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 8, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 15, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 50, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 58, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 69, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 74, 'quantity': 1, 'item_name': 'Chips and Tomatillo-Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 85, 'quantity': 1, 'item_name': 'Chip

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [34]:
# Compile unique list of chips_orders item names

# Use replacement on "-" given that there seem to be two versions of the same item
# being "Chips and Tomatillo-Green Chili Salsa"
# >> Turned out more complicated than expected so see extra workings below

# Don't replace on "-" given that there seem to be two versions of the same item
chips_item_names = list(set([i['item_name'] for i in chips_orders]))

In [35]:
chips_item_names

['Chips and Fresh Tomato Salsa',
 'Chips',
 'Side of Chips',
 'Chips and Mild Fresh Tomato Salsa',
 'Chips and Tomatillo-Green Chili Salsa',
 'Chips and Roasted Chili-Corn Salsa',
 'Chips and Roasted Chili Corn Salsa',
 'Chips and Tomatillo Green Chili Salsa',
 'Chips and Guacamole',
 'Chips and Tomatillo-Red Chili Salsa',
 'Chips and Tomatillo Red Chili Salsa']

In [36]:
chips_orders[:3]

[{'order_id': 1,
  'quantity': 1,
  'item_name': 'Chips and Fresh Tomato Salsa',
  'choice_description': 'NULL',
  'item_price': 2.39},
 {'order_id': 1,
  'quantity': 1,
  'item_name': 'Chips and Tomatillo-Green Chili Salsa',
  'choice_description': 'NULL',
  'item_price': 2.39},
 {'order_id': 3,
  'quantity': 1,
  'item_name': 'Side of Chips',
  'choice_description': 'NULL',
  'item_price': 1.69}]

In [37]:
[print(i) for i in chips_orders if (i['item_name'] == 'Chips' and i['choice_description'] != 'NULL')]

[]

In [38]:
[print(i) for i in chips_orders if (i['item_name'] == 'Side of Chips' and i['choice_description'] != 'NULL')]

[]

In [39]:
chip_dict = {}

for chip_name in chips_item_names:
    num_orders = sum([i['quantity'] for i in chips_orders if i['item_name']==chip_name])
    chip_dict.update({chip_name : num_orders})

In [40]:
chip_dict # This is the output with the "-" replace used

{'Chips and Fresh Tomato Salsa': 130,
 'Chips': 230,
 'Side of Chips': 110,
 'Chips and Mild Fresh Tomato Salsa': 1,
 'Chips and Tomatillo-Green Chili Salsa': 33,
 'Chips and Roasted Chili-Corn Salsa': 18,
 'Chips and Roasted Chili Corn Salsa': 23,
 'Chips and Tomatillo Green Chili Salsa': 45,
 'Chips and Guacamole': 506,
 'Chips and Tomatillo-Red Chili Salsa': 25,
 'Chips and Tomatillo Red Chili Salsa': 50}

**Output of total types of chip orders below**

* This is the output **without** the "-" replace used
* This highlights an error on the first pass, because in the cell above he total for Green Chili Salsa should be 78 not 45.
* This is because when I adjusted the names at the top level, I didn't in the underlying data.

In [180]:
chip_dict 

{'Chips and Tomatillo Red Chili Salsa': 50,
 'Chips and Fresh Tomato Salsa': 130,
 'Chips and Tomatillo Green Chili Salsa': 45,
 'Chips and Tomatillo-Green Chili Salsa': 33,
 'Chips': 230,
 'Chips and Roasted Chili-Corn Salsa': 18,
 'Side of Chips': 110,
 'Chips and Guacamole': 506,
 'Chips and Roasted Chili Corn Salsa': 23,
 'Chips and Tomatillo-Red Chili Salsa': 25,
 'Chips and Mild Fresh Tomato Salsa': 1}

In [53]:
chips_orders[0]

{'order_id': 1,
 'quantity': 1,
 'item_name': 'Chips and Fresh Tomato Salsa',
 'choice_description': 'NULL',
 'item_price': 2.39}

In [43]:
# So take a copy of chips_orders and clean it up
chips_orders_clean = [i for i in chips_orders]

#chips_orders_clean = chips_orders 
# This will just reference the original list of dicts


In [44]:
chips_orders_clean is chips_orders

False

In [48]:
print(chips_orders_clean[1])
print(chips_orders_clean[8])

{'order_id': 1, 'quantity': 1, 'item_name': 'Chips and Tomatillo Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}
{'order_id': 15, 'quantity': 1, 'item_name': 'Chips and Tomatillo Green Chili Salsa', 'choice_description': 'NULL', 'item_price': 2.39}


In [45]:
for i in chips_orders_clean:
    i['item_name'] = i['item_name'].replace('-',' ')

In [53]:
chip_dict_2 = {}

for chip_name in chips_item_names:
    num_orders = sum([i['quantity'] for i in chips_orders_clean if i['item_name']==chip_name])
    if num_orders > 0:    # This makes sure the name appears at all
        chip_dict_2.update({chip_name : num_orders})

In [54]:
chip_dict_2

{'Chips and Fresh Tomato Salsa': 130,
 'Chips': 230,
 'Side of Chips': 110,
 'Chips and Mild Fresh Tomato Salsa': 1,
 'Chips and Roasted Chili Corn Salsa': 41,
 'Chips and Tomatillo Green Chili Salsa': 78,
 'Chips and Guacamole': 506,
 'Chips and Tomatillo Red Chili Salsa': 75}

---

## Bonus: Craft a problem statement about this data that interests you, and then answer it!


**Actioned options:**

* Look at number of toppings with different food types
* Look at the distribution of items in orders

**Potential options:**
* Look at the distribution of item values, across different items
* Try and infer the price of different additions / extras (see if there is any?)

**Answer some general questions:**
1. Breakdown of orders with a 0,1,2,... drinks?
2. What proportion of orders include 0,1,2 chips as a side?
3. (More interesting) grid of mix of orders (i.e. x-axis drinks / y-axis chips


In [57]:
data_dicts[:4]

[{'order_id': 1,
  'quantity': 1,
  'item_name': 'Chips and Fresh Tomato Salsa',
  'choice_description': 'NULL',
  'item_price': 2.39},
 {'order_id': 1,
  'quantity': 1,
  'item_name': 'Izze',
  'choice_description': '[Clementine]',
  'item_price': 3.39},
 {'order_id': 1,
  'quantity': 1,
  'item_name': 'Nantucket Nectar',
  'choice_description': '[Apple]',
  'item_price': 3.39},
 {'order_id': 1,
  'quantity': 1,
  'item_name': 'Chips and Tomatillo Green Chili Salsa',
  'choice_description': 'NULL',
  'item_price': 2.39}]

## Use numpy and pandas to make life easier!

In [59]:
import numpy as np
import pandas as pd

In [62]:
df = pd.read_csv(DATA_FILE,sep='\t')

In [63]:
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [65]:
df.shape

(4622, 5)

In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   order_id            4622 non-null   int64  
 1   quantity            4622 non-null   int64  
 2   item_name           4622 non-null   object 
 3   choice_description  3376 non-null   object 
 4   item_price          4622 non-null   object 
 5   item_price_value    4622 non-null   float64
 6   description_clean   4622 non-null   object 
dtypes: float64(1), int64(2), object(4)
memory usage: 252.9+ KB


In [90]:
df['item_price_value'] = df['item_price'].apply(lambda x: float(x.lstrip("$")))

In [91]:
df['item_price_value'][:5]

0     2.39
1     3.39
2     3.39
3     2.39
4    16.98
Name: item_price_value, dtype: float64

## Look at toppings mix

In [105]:
?df.notnull

In [106]:
df['choice_description'][df['choice_description'].notnull()]

1                                            [Clementine]
2                                                 [Apple]
4       [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
5       [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...
7       [Tomatillo Red Chili Salsa, [Fajita Vegetables...
                              ...                        
4617    [Fresh Tomato Salsa, [Rice, Black Beans, Sour ...
4618    [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...
4619    [Fresh Tomato Salsa, [Fajita Vegetables, Pinto...
4620    [Fresh Tomato Salsa, [Fajita Vegetables, Lettu...
4621    [Fresh Tomato Salsa, [Fajita Vegetables, Pinto...
Name: choice_description, Length: 3376, dtype: object

In [107]:
df['description_clean'] = df['choice_description'][df['choice_description'].notnull()].apply(lambda x: (str(x).replace("[",'')).replace("]",''))

In [114]:
print(df['description_clean'][4])

Tomatillo-Red Chili Salsa (Hot), Black Beans, Rice, Cheese, Sour Cream


In [120]:
df['number_of_choices'] = df['choice_description'][df['choice_description'].notnull()].apply(lambda x: 1 + x.count(","))

In [131]:
?pd.Series.groupby

In [132]:
# Look at items with > 1 choice typically associated with them

(df['item_name'][df['number_of_choices']>1])

4             Chicken Bowl
5             Chicken Bowl
7            Steak Burrito
8         Steak Soft Tacos
9            Steak Burrito
               ...        
4617         Steak Burrito
4618         Steak Burrito
4619    Chicken Salad Bowl
4620    Chicken Salad Bowl
4621    Chicken Salad Bowl
Name: item_name, Length: 2827, dtype: object

In [137]:
(df['item_name'][df['number_of_choices']>1]).value_counts()

Chicken Bowl             720
Chicken Burrito          548
Steak Burrito            363
Steak Bowl               209
Chicken Soft Tacos       114
Chicken Salad Bowl       108
Veggie Burrito            94
Barbacoa Burrito          88
Veggie Bowl               85
Carnitas Bowl             68
Barbacoa Bowl             61
Carnitas Burrito          58
Steak Soft Tacos          53
Chicken Crispy Tacos      46
Carnitas Soft Tacos       37
Steak Crispy Tacos        34
Steak Salad Bowl          29
Barbacoa Soft Tacos       25
Veggie Salad Bowl         18
Barbacoa Crispy Tacos     11
Chicken Salad              9
Barbacoa Salad Bowl        8
Veggie Soft Tacos          7
Carnitas Crispy Tacos      6
Carnitas Salad Bowl        6
Burrito                    6
Veggie Salad               6
Steak Salad                4
Bowl                       2
Salad                      2
Veggie Crispy Tacos        1
Carnitas Salad             1
Name: item_name, dtype: int64

In [150]:
(if 'Chicken' in df['item_name']: True).value_counts()

SyntaxError: invalid syntax (<ipython-input-150-112ac15bc574>, line 1)

In [151]:
df['item_name'].str.contains('Chicken')

0       False
1       False
2       False
3       False
4        True
        ...  
4617    False
4618    False
4619     True
4620     True
4621     True
Name: item_name, Length: 4622, dtype: bool

In [161]:
# Create a column calling out Chicken / Beef / Veggie or Other

conditions = [
    df['item_name'].str.contains('Chicken'),
    df['item_name'].str.contains('Steak'),  
    df['item_name'].str.contains('Barbacoa'),
    df['item_name'].str.contains('Carnitas'),
    df['item_name'].str.contains('Veggie')  
]


results = [
    'Chicken',
    'Beef',
    'Beef',
    'Pork',
    'Veggie'    
]
 
df['protein'] = np.select(conditions, results, 'Other')   

In [170]:
?df.drop

In [171]:
df.drop('Protein',axis=1,inplace=True)

In [172]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   order_id            4622 non-null   int64  
 1   quantity            4622 non-null   int64  
 2   item_name           4622 non-null   object 
 3   choice_description  3376 non-null   object 
 4   item_price          4622 non-null   object 
 5   item_price_value    4622 non-null   float64
 6   description_clean   3376 non-null   object 
 7   number_of_choices   3376 non-null   float64
 8   protein             4622 non-null   object 
dtypes: float64(2), int64(2), object(5)
memory usage: 325.1+ KB


In [163]:
df[['item_name','protein']][:20]

Unnamed: 0,item_name,protein
0,Chips and Fresh Tomato Salsa,Other
1,Izze,Other
2,Nantucket Nectar,Other
3,Chips and Tomatillo-Green Chili Salsa,Other
4,Chicken Bowl,Chicken
5,Chicken Bowl,Chicken
6,Side of Chips,Other
7,Steak Burrito,Beef
8,Steak Soft Tacos,Beef
9,Steak Burrito,Beef


In [176]:
df['protein'].value_counts()

Other      1764
Chicken    1560
Beef        905
Veggie      212
Pork        181
Name: protein, dtype: int64

In [180]:
?pd.Series.value_counts

In [182]:
df['protein'][(df.protein != 'Other')].value_counts(normalize = True)

Chicken    0.545836
Beef       0.316655
Veggie     0.074178
Pork       0.063331
Name: protein, dtype: float64

In [183]:
# Create a column calling out Bowl / Burrito / Tacos or Other

conditions = [
    df['item_name'].str.contains('Bowl'),
    df['item_name'].str.contains('Burrito'),  
    df['item_name'].str.contains('Tacos'), 
]


results = [
    'Bowl',
    'Burrito',
    'Tacos'   
]
 
df['form_factor'] = np.select(conditions, results, 'Other')   

In [184]:
df['form_factor'].value_counts()

Other      1774
Bowl       1331
Burrito    1172
Tacos       345
Name: form_factor, dtype: int64

In [185]:
df['form_factor'][(df.protein != 'Other')].value_counts(normalize = True)

Bowl       0.465010
Burrito    0.407978
Tacos      0.120014
Other      0.006998
Name: form_factor, dtype: float64

In [186]:
df.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,item_price_value,description_clean,number_of_choices,protein,form_factor
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,2.39,,,Other,Other
1,1,1,Izze,[Clementine],$3.39,3.39,Clementine,1.0,Other,Other
2,1,1,Nantucket Nectar,[Apple],$3.39,3.39,Apple,1.0,Other,Other
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,2.39,,,Other,Other
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98,16.98,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",5.0,Chicken,Bowl
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98,10.98,"Fresh Tomato Salsa (Mild), Rice, Cheese, Sour ...",6.0,Chicken,Bowl
6,3,1,Side of Chips,,$1.69,1.69,,,Other,Other
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75,11.75,"Tomatillo Red Chili Salsa, Fajita Vegetables, ...",8.0,Beef,Burrito
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25,9.25,"Tomatillo Green Chili Salsa, Pinto Beans, Chee...",5.0,Beef,Tacos
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25,9.25,"Fresh Tomato Salsa, Rice, Black Beans, Pinto B...",7.0,Beef,Burrito


In [194]:
# Average number of choices per form factor

df['number_of_choices'][(df['form_factor'] != 'Other')].groupby(df['form_factor']).mean()


form_factor
Bowl       5.503381
Burrito    5.395051
Tacos      4.318841
Name: number_of_choices, dtype: float64

In [198]:
# Average number of choices per form factor

df['number_of_choices'][(df['form_factor'] != 'Other')].groupby([df['protein'],df['form_factor']]).mean().reset_index()


Unnamed: 0,protein,form_factor,number_of_choices
0,Beef,Bowl,5.503165
1,Beef,Burrito,5.35512
2,Beef,Tacos,4.373016
3,Chicken,Bowl,5.370813
4,Chicken,Burrito,5.329114
5,Chicken,Tacos,4.216049
6,Other,Bowl,5.5
7,Other,Burrito,5.833333
8,Other,Tacos,1.0
9,Pork,Bowl,6.108108


In [199]:
df['number_of_choices'][(df['form_factor'] != 'Other')].groupby([df['protein'],df['form_factor']]).count().reset_index()


Unnamed: 0,protein,form_factor,number_of_choices
0,Beef,Bowl,316
1,Beef,Burrito,459
2,Beef,Tacos,126
3,Chicken,Bowl,836
4,Chicken,Burrito,553
5,Chicken,Tacos,162
6,Other,Bowl,2
7,Other,Burrito,6
8,Other,Tacos,2
9,Pork,Bowl,74


In [203]:
avg = df['number_of_choices'][(df['form_factor'] != 'Other')].groupby([df['protein'],df['form_factor']]).mean().reset_index()
count = df['number_of_choices'][(df['form_factor'] != 'Other')].groupby([df['protein'],df['form_factor']]).count().reset_index()

df_out = avg.merge(count, how='inner', on = ['protein','form_factor'])
df_out

Unnamed: 0,protein,form_factor,number_of_choices_x,number_of_choices_y
0,Beef,Bowl,5.503165,316
1,Beef,Burrito,5.35512,459
2,Beef,Tacos,4.373016,126
3,Chicken,Bowl,5.370813,836
4,Chicken,Burrito,5.329114,553
5,Chicken,Tacos,4.216049,162
6,Other,Bowl,5.5,2
7,Other,Burrito,5.833333,6
8,Other,Tacos,1.0,2
9,Pork,Bowl,6.108108,74


In [232]:
?pd.Series.rename_axis

In [226]:
df_out.number_of_choices_x.rename('average',inplace=True)

0     5.503165
1     5.355120
2     4.373016
3     5.370813
4     5.329114
5     4.216049
6     5.500000
7     5.833333
8     1.000000
9     6.108108
10    5.372881
11    4.382979
12    6.145631
13    5.957895
14    6.000000
Name: average, dtype: float64

In [246]:
df_out.rename(columns={'number_of_choices_x':'average','number_of_choices_y':'count'},inplace=True)

In [247]:
df_out

Unnamed: 0,protein,form_factor,average,count
0,Beef,Bowl,5.503165,316
1,Beef,Burrito,5.35512,459
2,Beef,Tacos,4.373016,126
3,Chicken,Bowl,5.370813,836
4,Chicken,Burrito,5.329114,553
5,Chicken,Tacos,4.216049,162
6,Other,Bowl,5.5,2
7,Other,Burrito,5.833333,6
8,Other,Tacos,1.0,2
9,Pork,Bowl,6.108108,74


## Look at distribution of order values, and order sizes

In [249]:
# Create a view of the orders / items in orderes

df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,item_price_value,description_clean,number_of_choices,protein,form_factor
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,2.39,,,Other,Other
1,1,1,Izze,[Clementine],$3.39,3.39,Clementine,1.0,Other,Other
2,1,1,Nantucket Nectar,[Apple],$3.39,3.39,Apple,1.0,Other,Other
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,2.39,,,Other,Other
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98,16.98,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",5.0,Chicken,Bowl


In [267]:
df[['order_id','quantity','item_price_value']]

Unnamed: 0,order_id,quantity,item_price_value
0,1,1,2.39
1,1,1,3.39
2,1,1,3.39
3,1,1,2.39
4,2,2,16.98
...,...,...,...
4617,1833,1,11.75
4618,1833,1,11.75
4619,1834,1,11.25
4620,1834,1,8.75


In [272]:
?pd.Series.sum

In [292]:
df_orders = df[['quantity','item_price_value']].groupby(df['order_id']).sum().reset_index()

In [293]:
df_orders['avg_price_per_item'] = df_orders['item_price_value'] / df_orders['quantity']

In [309]:
?df.rename

In [294]:
df_orders.rename(columns={'quantity':'items_per_order','item_price_value':'order_value'},inplace=True)

In [295]:
df_orders

Unnamed: 0,order_id,items_per_order,order_value,avg_price_per_item
0,1,4,11.56,2.890000
1,2,2,16.98,8.490000
2,3,2,12.67,6.335000
3,4,2,21.00,10.500000
4,5,2,13.70,6.850000
...,...,...,...,...
1829,1830,2,23.00,11.500000
1830,1831,3,12.90,4.300000
1831,1832,2,13.20,6.600000
1832,1833,2,23.50,11.750000


In [296]:
df_orders['order_value'].describe()

count    1834.000000
mean       18.811429
std        11.652512
min        10.080000
25%        12.572500
50%        16.200000
75%        21.960000
max       205.250000
Name: order_value, dtype: float64

In [297]:
df_orders['items_per_order'].describe()

count    1834.000000
mean        2.711014
std         1.677624
min         1.000000
25%         2.000000
50%         2.000000
75%         3.000000
max        35.000000
Name: items_per_order, dtype: float64

In [301]:
df_orders['items_per_order'].quantile(.9)

4.0

In [302]:
df_orders['items_per_order'].quantile(.95)

5.0

In [303]:
df_orders['items_per_order'].quantile(.99)

8.0

In [327]:
conditions = [
    df_orders['items_per_order'] == 1,
    df_orders['items_per_order'] == 2,
    df_orders['items_per_order'] == 3,
    df_orders['items_per_order'] == 4,
    df_orders['items_per_order'] == 5,
    (df_orders['items_per_order'] >= 5) & (df_orders['items_per_order'] <= 10),
    df_orders['items_per_order'] > 10
]

results = [
    'a. 1',
    'b. 2',
    'c. 3',
    'd. 4',
    'e. 5',
    'f. 6-10',
    'g. 10+'    
]

df_orders['items_per_order_cat'] = np.select(conditions,results,'Other')

In [332]:
df_orders.head(5)

Unnamed: 0,order_id,items_per_order,order_value,avg_price_per_item,items_per_order_cat
0,1,4,11.56,2.89,d. 4
1,2,2,16.98,8.49,b. 2
2,3,2,12.67,6.335,b. 2
3,4,2,21.0,10.5,b. 2
4,5,2,13.7,6.85,b. 2


In [328]:
df_orders['items_per_order_cat'].value_counts()

b. 2       1038
c. 3        455
d. 4        179
a. 1         56
e. 5         50
f. 6-10      44
g. 10+       12
Name: items_per_order_cat, dtype: int64

In [331]:
(df_orders['items_per_order_cat'].value_counts().reset_index()).sort_values('index')

Unnamed: 0,index,items_per_order_cat
3,a. 1,56
0,b. 2,1038
1,c. 3,455
2,d. 4,179
4,e. 5,50
5,f. 6-10,44
6,g. 10+,12


In [329]:
df_orders['items_per_order_cat'].value_counts(normalize = 'True')

b. 2       0.565976
c. 3       0.248092
d. 4       0.097601
a. 1       0.030534
e. 5       0.027263
f. 6-10    0.023991
g. 10+     0.006543
Name: items_per_order_cat, dtype: float64

In [317]:
?pd.Series.value_counts

In [330]:
(df_orders['items_per_order_cat'].value_counts(normalize = True).reset_index()).sort_values('index')

Unnamed: 0,index,items_per_order_cat
3,a. 1,0.030534
0,b. 2,0.565976
1,c. 3,0.248092
2,d. 4,0.097601
4,e. 5,0.027263
5,f. 6-10,0.023991
6,g. 10+,0.006543
