### 下面我将分享一个使用pandas中的str的示例，通过这个示例，你将可以制作一个简易的菜谱，
### 不过是英文的，也许了解一下其他国家的食物也不是坏事吧，哪天你可能想亲手尝试按照菜谱做一些
### 美味的外国风味也说不定呢！

- 通过这个示例你将看到如何实现以下内容：
	- 使用pandas读取json数据
	- 获取标题最长的菜单
	- 熟练使用str.contains方法
	- 创建一个由食材构成的表格
	- 利用此表格，获取所有包含你所需要食材的菜单

In [1]:
import numpy as np
import pandas as pd

#### 使用pandas读取json数据

In [2]:
with open('20170107-061401-recipeitems.json','r') as f:
    rec_li = [ rec.strip() for rec in f ] # 去除原始数据中每行内的空格及换行
    data_json = "[{}]".format(",".join(rec_li)) # 将数据制作为json格式
recipes = pd.read_json(data_json)

In [3]:
recipes.head(3)

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,creator,recipeCategory,dateModified,recipeInstructions
0,{'$oid': '5160756b96cc62079cc2db15'},Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276011104},PT30M,thepioneerwoman,12.0,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",,,,,
1,{'$oid': '5160756d96cc62079cc2db16'},Hot Roast Beef Sandwiches,12 whole Dinner Rolls Or Small Sandwich Buns (...,http://thepioneerwoman.com/cooking/2013/03/hot...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276013902},PT20M,thepioneerwoman,12.0,2013-03-13,PT20M,"When I was growing up, I participated in my Ep...",,,,,
2,{'$oid': '5160756f96cc6207a37ff777'},Morrocan Carrot and Chickpea Salad,Dressing:\n1 tablespoon cumin seeds\n1/3 cup /...,http://www.101cookbooks.com/archives/moroccan-...,http://www.101cookbooks.com/mt-static/images/f...,{'$date': 1365276015332},,101cookbooks,,2013-01-07,PT15M,A beauty of a carrot salad - tricked out with ...,,,,,


In [4]:
recipes.shape

(173278, 17)

In [5]:
# 查看一行菜谱包含的内容
recipes.iloc[1]

_id                                {'$oid': '5160756d96cc62079cc2db16'}
name                                          Hot Roast Beef Sandwiches
ingredients           12 whole Dinner Rolls Or Small Sandwich Buns (...
url                   http://thepioneerwoman.com/cooking/2013/03/hot...
image                 http://static.thepioneerwoman.com/cooking/file...
ts                                             {'$date': 1365276013902}
cookTime                                                          PT20M
source                                                  thepioneerwoman
recipeYield                                                          12
datePublished                                                2013-03-13
prepTime                                                          PT20M
description           When I was growing up, I participated in my Ep...
totalTime                                                           NaN
creator                                                         

In [6]:
# 对菜谱中原料一列的字符长度进行统计
recipes['ingredients'].str.len().describe()

count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64

#### 找到所需材料最长的菜单的名字

In [7]:
recipes['name'][np.argmax(recipes['ingredients'].str.len())]

The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  return bound(*args, **kwds)


'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'

#### 使用str.contains找到符合要求的菜单，并完成聚合操作

In [8]:
recipes['description'].str.contains(r'[bB]reakfast').sum()

3524

In [9]:
recipes['ingredients'].str.contains(r'[Cc]innamon').sum()

10526

In [10]:
recipes['ingredients'].str.contains(r'[Cc]inamon').sum()

11

#### 创建一个由食材构成的表格

In [11]:
# 构建一个包含你感兴趣的食材构成的列表
spice_li = ['salt','pepper','oregano','sage','parsley','rosemary','tarragon','thyme','paprika','cumin']

In [12]:
import re
spice_dict = dict((spice,recipes['ingredients'].str.contains(spice,re.IGNORECASE)) for spice in spice_li)
spice_df = pd.DataFrame(spice_dict)

In [13]:
spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


#### 利用此表格，获取所有包含你所需要食材的菜单

In [14]:
selection = spice_df.query("parsley & tarragon & paprika")

In [15]:
selection

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
2069,False,True,False,False,True,False,True,False,True,False
74964,False,False,False,False,True,False,True,False,True,False
93768,True,True,False,True,True,False,True,False,True,False
113926,True,True,False,False,True,False,True,False,True,False
137686,True,True,False,False,True,False,True,False,True,False
140530,True,True,False,False,True,False,True,True,True,False
158475,True,True,False,False,True,False,True,False,True,True
158486,True,True,False,False,True,False,True,False,True,False
163175,True,True,True,False,True,False,True,False,True,False
165243,True,True,False,False,True,False,True,False,True,False


In [16]:
selection.sum()

salt         8
pepper       9
oregano      1
sage         1
parsley     10
rosemary     0
tarragon    10
thyme        1
paprika     10
cumin        1
dtype: int64

In [17]:
recipes.iloc[selection.index]['name']

2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object