In [11]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [12]:
import pandas as pd
import numpy as np

<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

<!--NAVIGATION-->
< [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >

<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# Vectorized String Operations

One strength of Python is its relative ease in handling and manipulating string data.
Pandas builds on this and provides a comprehensive set of *vectorized string operations* that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data.
In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

## Introducing Pandas String Operations

We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

In [13]:
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

array([ 4,  6, 10, 14, 22, 26])

This *vectorization* of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done.
For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:

In [14]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing values.
For example:

In [15]:
# this give error
#data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
#[s.capitalize() for s in data]

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.
So, for example, suppose we create a Pandas Series with this data:

In [16]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

In [17]:
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

## Tables of Pandas String Methods

If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties.
The examples in this section use the following series of names:

In [18]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

### Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

In [19]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

But some others return numbers:

In [20]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

Or Boolean values:

In [21]:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

Still others return lists or other compound values for each element:

In [22]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Get just first names:

In [23]:
temp = monte.str.split()
temp
temp[2][0]

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

'Terry'

In [24]:
list(temp.apply(lambda x: x[0]))

['Graham', 'John', 'Terry', 'Eric', 'Terry', 'Michael']

In [25]:
temp.apply(lambda x: x[0]).values

array(['Graham', 'John', 'Terry', 'Eric', 'Terry', 'Michael'],
      dtype=object)

We'll see further manipulations of this kind of series-of-lists object as we continue our discussion.

### Methods using regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

With these, you can do a wide range of interesting operations.
For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

In [26]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:   
`[AEIOU]` means NOT one of those letters   
`.*` means any number of any characters

In [27]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries opens up many possibilities for analysis and cleaning of data.

### Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations:

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

#### Vectorized item access and slicing

The ``get()`` and ``slice()`` operations, in particular, enable vectorized element access from each array.
For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.
Note that this behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:

In [28]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [29]:
monte.str.slice(0,3)

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.

These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.
For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:

In [30]:
monte.str.get(2)

0    a
1    h
2    r
3    i
4    r
5    c
dtype: object

In [31]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

#### Indicator variables

Another method that requires a bit of extra explanation is the ``get_dummies()`` method.
This is useful when your data has a column containing some sort of coded indicator.
For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [32]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:

In [33]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

We won't dive further into these methods here, but I encourage you to read through ["Working with Text Data"](http://pandas.pydata.org/pandas-docs/stable/text.html) in the Pandas online documentation, or to refer to the resources listed in [Further Resources](03.13-Further-Resources.ipynb).

## Example: Recipe Database

These vectorized string operations become most useful in the process of cleaning up messy, real-world data.
Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web.
Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.

As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:

In [None]:
# this worked fine to get file
# !curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz

In [None]:
# cell does not work
# gunzip is a GNU program

#!gunzip 20170107-061401-recipeitems.json.gz

### new cells: figure out how to read and make dataframe

In [115]:
with open('20170107-061401-recipeitems.json') as f:
    line = f.readline()
type(line)
line

str

'{ "_id" : { "$oid" : "5160756b96cc62079cc2db15" }, "name" : "Drop Biscuits and Sausage Gravy", "ingredients" : "Biscuits\\n3 cups All-purpose Flour\\n2 Tablespoons Baking Powder\\n1/2 teaspoon Salt\\n1-1/2 stick (3/4 Cup) Cold Butter, Cut Into Pieces\\n1-1/4 cup Butermilk\\n SAUSAGE GRAVY\\n1 pound Breakfast Sausage, Hot Or Mild\\n1/3 cup All-purpose Flour\\n4 cups Whole Milk\\n1/2 teaspoon Seasoned Salt\\n2 teaspoons Black Pepper, More To Taste", "url" : "http://thepioneerwoman.com/cooking/2013/03/drop-biscuits-and-sausage-gravy/", "image" : "http://static.thepioneerwoman.com/cooking/files/2013/03/bisgrav.jpg", "ts" : { "$date" : 1365276011104 }, "cookTime" : "PT30M", "source" : "thepioneerwoman", "recipeYield" : "12", "datePublished" : "2013-03-11", "prepTime" : "PT10M", "description" : "Late Saturday afternoon, after Marlboro Man had returned home with the soccer-playing girls, and I had returned home with the..." }\n'

In [116]:
import json
j1 = json.loads(line)

In [117]:
type(j1)
j1.keys()
j1

dict

dict_keys(['_id', 'name', 'ingredients', 'url', 'image', 'ts', 'cookTime', 'source', 'recipeYield', 'datePublished', 'prepTime', 'description'])

{'_id': {'$oid': '5160756b96cc62079cc2db15'},
 'name': 'Drop Biscuits and Sausage Gravy',
 'ingredients': 'Biscuits\n3 cups All-purpose Flour\n2 Tablespoons Baking Powder\n1/2 teaspoon Salt\n1-1/2 stick (3/4 Cup) Cold Butter, Cut Into Pieces\n1-1/4 cup Butermilk\n SAUSAGE GRAVY\n1 pound Breakfast Sausage, Hot Or Mild\n1/3 cup All-purpose Flour\n4 cups Whole Milk\n1/2 teaspoon Seasoned Salt\n2 teaspoons Black Pepper, More To Taste',
 'url': 'http://thepioneerwoman.com/cooking/2013/03/drop-biscuits-and-sausage-gravy/',
 'image': 'http://static.thepioneerwoman.com/cooking/files/2013/03/bisgrav.jpg',
 'ts': {'$date': 1365276011104},
 'cookTime': 'PT30M',
 'source': 'thepioneerwoman',
 'recipeYield': '12',
 'datePublished': '2013-03-11',
 'prepTime': 'PT10M',
 'description': 'Late Saturday afternoon, after Marlboro Man had returned home with the soccer-playing girls, and I had returned home with the...'}

In [118]:
j1['_id'] = j1['_id']['$oid']
j1

{'_id': '5160756b96cc62079cc2db15',
 'name': 'Drop Biscuits and Sausage Gravy',
 'ingredients': 'Biscuits\n3 cups All-purpose Flour\n2 Tablespoons Baking Powder\n1/2 teaspoon Salt\n1-1/2 stick (3/4 Cup) Cold Butter, Cut Into Pieces\n1-1/4 cup Butermilk\n SAUSAGE GRAVY\n1 pound Breakfast Sausage, Hot Or Mild\n1/3 cup All-purpose Flour\n4 cups Whole Milk\n1/2 teaspoon Seasoned Salt\n2 teaspoons Black Pepper, More To Taste',
 'url': 'http://thepioneerwoman.com/cooking/2013/03/drop-biscuits-and-sausage-gravy/',
 'image': 'http://static.thepioneerwoman.com/cooking/files/2013/03/bisgrav.jpg',
 'ts': {'$date': 1365276011104},
 'cookTime': 'PT30M',
 'source': 'thepioneerwoman',
 'recipeYield': '12',
 'datePublished': '2013-03-11',
 'prepTime': 'PT10M',
 'description': 'Late Saturday afternoon, after Marlboro Man had returned home with the soccer-playing girls, and I had returned home with the...'}

In [119]:
j1['ts'] = j1['ts']['$date']
j1

{'_id': '5160756b96cc62079cc2db15',
 'name': 'Drop Biscuits and Sausage Gravy',
 'ingredients': 'Biscuits\n3 cups All-purpose Flour\n2 Tablespoons Baking Powder\n1/2 teaspoon Salt\n1-1/2 stick (3/4 Cup) Cold Butter, Cut Into Pieces\n1-1/4 cup Butermilk\n SAUSAGE GRAVY\n1 pound Breakfast Sausage, Hot Or Mild\n1/3 cup All-purpose Flour\n4 cups Whole Milk\n1/2 teaspoon Seasoned Salt\n2 teaspoons Black Pepper, More To Taste',
 'url': 'http://thepioneerwoman.com/cooking/2013/03/drop-biscuits-and-sausage-gravy/',
 'image': 'http://static.thepioneerwoman.com/cooking/files/2013/03/bisgrav.jpg',
 'ts': 1365276011104,
 'cookTime': 'PT30M',
 'source': 'thepioneerwoman',
 'recipeYield': '12',
 'datePublished': '2013-03-11',
 'prepTime': 'PT10M',
 'description': 'Late Saturday afternoon, after Marlboro Man had returned home with the soccer-playing girls, and I had returned home with the...'}

In [124]:
temp_df = pd.DataFrame(j1, index=[0], columns=j1.keys())
temp_df

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description
0,5160756b96cc62079cc2db15,Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,1365276011104,PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha..."


### try to exclude some keys from dataframe

In [161]:
my_cols = list(j1.keys())
my_cols.remove('_id')
my_cols.remove('ts')
my_cols

['name',
 'ingredients',
 'url',
 'image',
 'cookTime',
 'source',
 'recipeYield',
 'datePublished',
 'prepTime',
 'description']

In [162]:
temp_df2 = pd.DataFrame(j1, index=[0], columns=my_cols)
temp_df2

Unnamed: 0,name,ingredients,url,image,cookTime,source,recipeYield,datePublished,prepTime,description
0,Beef Stew,2 pounds 2 pounds\n1 pound 1 pound\n5 whole 5 ...,http://tastykitchen.com/recipes/soups/beef-ste...,http://static.tastykitchen.com/wp-content/them...,PT3H,tastykitchen,12,2009-09-17,PT15M,I love this stew in the fall and winter. It i...


### original cells below

Yes, apparently each line is a valid JSON, so we'll need to string them together.
One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:

In [None]:
# original
# read the entire file into a Python array
with open('20170107-061401-recipeitems.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

### new  cells: try to process entire file

In [135]:
# interupted this after 5-10 min and got about 22,000 recipes

recipes_df = pd.DataFrame()

with open('20170107-061401-recipeitems.json', 'r', encoding='utf-8') as f:
    # Extract each line
    icount = 0
    for line in f:
        j1 = json.loads(line)
        j1['_id'] = j1['_id']['$oid']
        j1['ts'] = j1['ts']['$date']
        
        temp_df = pd.DataFrame(j1, index=[icount], columns=j1.keys())
        recipes_df = temp_df.append(recipes_df)
        
        icount +=1
        
        #if icount > 5:
            #break


KeyboardInterrupt: 

In [141]:
recipes_df.shape
recipes_df.columns
recipes_df.info()
recipes_df.head()
recipes_df.tail()

(22457, 17)

Index(['_id', 'name', 'ingredients', 'url', 'image', 'ts', 'cookTime',
       'source', 'recipeYield', 'datePublished', 'prepTime', 'description',
       'totalTime', 'recipeInstructions', 'recipeCategory', 'dateModified',
       'creator'],
      dtype='object')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22457 entries, 22456 to 0
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   _id                 22457 non-null  object
 1   name                22457 non-null  object
 2   ingredients         22457 non-null  object
 3   url                 22457 non-null  object
 4   image               16544 non-null  object
 5   ts                  22457 non-null  int64 
 6   cookTime            13673 non-null  object
 7   source              22457 non-null  object
 8   recipeYield         21914 non-null  object
 9   datePublished       14847 non-null  object
 10  prepTime            15411 non-null  object
 11  description         15495 non-null  object
 12  totalTime           339 non-null    object
 13  recipeInstructions  4 non-null      object
 14  recipeCategory      301 non-null    object
 15  dateModified        141 non-null    object
 16  creator             32

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,recipeInstructions,recipeCategory,dateModified,creator
22456,51611fd596cc620d26155415,Beef Stew,2 pounds 2 pounds\n2 Tablespoons 2 Tablespoons...,http://tastykitchen.com/recipes/soups/beef-ste...,http://static.tastykitchen.com/recipes/files/2...,1365319637946,PT2H15M,tastykitchen,8,2009-10-30,PT30M,I’ve been making stew using this same recipe f...,,,,,
22455,51611fd496cc620d26155414,Beef Stew,3 Tablespoons 3 Tablespoons\n½ teaspoons ½ tea...,http://tastykitchen.com/recipes/soups/beef-ste...,http://static.tastykitchen.com/recipes/files/2...,1365319636310,PT2H,tastykitchen,6,2010-11-02,PT30M,"When the weather starts getting cooler, this i...",,,,,
22454,51611fd396cc620d26155413,"Beef Stew with Kale, Sweet Potatoes and Quinoa",1 Tablespoon 1 Tablespoon\n1 pound 1 pound\n1 ...,http://tastykitchen.com/recipes/soups/beef-ste...,http://static.tastykitchen.com/recipes/files/2...,1365319635180,PT1H,tastykitchen,4,2011-03-26,PT10M,"If you like hot stew on a cold day, and you li...",,,,,
22453,51611fd096cc620d26155412,Cinnamon-Maple Pumpkin Coffee Cake,1-½ cup 1-½ cup\n1-½ cup 1-½ cup\n2-½ teaspoon...,http://tastykitchen.com/recipes/breakfastbrunc...,http://static.tastykitchen.com/recipes/files/2...,1365319632699,PT35M,tastykitchen,9,2012-10-18,PT,A tender pumpkin coffee cake layered with a ci...,,,,,
22452,51611fcf96cc620d26155411,Slab Apple Pie,1 whole 1 whole\n8 whole 8 whole\n1 teaspoon 1...,http://tastykitchen.com/recipes/desserts/slab-...,http://static.tastykitchen.com/recipes/files/2...,1365319631471,PT30M,tastykitchen,12,2012-10-18,PT1H,"The perfect fall comfort food, this slab-style...",,,,,


Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,recipeInstructions,recipeCategory,dateModified,creator
4,5160757496cc6207a37ff778,Pomegranate Yogurt Bowl,For each bowl: \na big dollop of Greek yogurt\...,http://www.101cookbooks.com/archives/pomegrana...,http://www.101cookbooks.com/mt-static/images/f...,1365276020318,,101cookbooks,Serves 1.,2013-01-20,PT5M,A simple breakfast bowl made with Greek yogurt...,,,,,
3,5160757096cc62079cc2db17,Mixed Berry Shortcake,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/mix...,http://static.thepioneerwoman.com/cooking/file...,1365276016700,PT15M,thepioneerwoman,8,2013-03-18,PT15M,It's Monday! It's a brand new week! The birds ...,,,,,
2,5160756f96cc6207a37ff777,Morrocan Carrot and Chickpea Salad,Dressing:\n1 tablespoon cumin seeds\n1/3 cup /...,http://www.101cookbooks.com/archives/moroccan-...,http://www.101cookbooks.com/mt-static/images/f...,1365276015332,,101cookbooks,,2013-01-07,PT15M,A beauty of a carrot salad - tricked out with ...,,,,,
1,5160756d96cc62079cc2db16,Hot Roast Beef Sandwiches,12 whole Dinner Rolls Or Small Sandwich Buns (...,http://thepioneerwoman.com/cooking/2013/03/hot...,http://static.thepioneerwoman.com/cooking/file...,1365276013902,PT20M,thepioneerwoman,12,2013-03-13,PT20M,"When I was growing up, I participated in my Ep...",,,,,
0,5160756b96cc62079cc2db15,Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,1365276011104,PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",,,,,


### ok, back to the original investigation

In [139]:
recipes_df.shape

(22457, 17)

We see there are nearly 200,000 recipes, and 17 columns.
Let's take a look at one row to see what we have:

In [140]:
recipes_df.iloc[0]

_id                                            51611fd596cc620d26155415
name                                                          Beef Stew
ingredients           2 pounds 2 pounds\n2 Tablespoons 2 Tablespoons...
url                   http://tastykitchen.com/recipes/soups/beef-ste...
image                 http://static.tastykitchen.com/recipes/files/2...
ts                                                        1365319637946
cookTime                                                        PT2H15M
source                                                     tastykitchen
recipeYield                                                           8
datePublished                                                2009-10-30
prepTime                                                          PT30M
description           I’ve been making stew using this same recipe f...
totalTime                                                           NaN
recipeInstructions                                              

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
Let's start by taking a closer look at the ingredients:

In [142]:
recipes_df.ingredients.str.len().describe()

count    22457.000000
mean       271.696976
std        175.934867
min          0.000000
25%        149.000000
50%        229.000000
75%        349.000000
max       3247.000000
Name: ingredients, dtype: float64

The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!

Just out of curiousity, let's see which recipe has the longest ingredient list:

In [143]:
recipes_df.name[np.argmax(recipes_df.ingredients.str.len())]

'Korean BBQ Marinade'

That certainly looks like an involved recipe.

We can do other aggregate explorations; for example, let's see how many of the recipes are for breakfast food:

In [144]:
recipes_df.description.str.contains('[Bb]reakfast').sum()

428

Or how many of the recipes list cinnamon as an ingredient:

In [145]:
recipes_df.ingredients.str.contains('[Cc]innamon').sum()

1126

We could even look to see whether any recipes misspell the ingredient as "cinamon":

In [146]:
recipes_df.ingredients.str.contains('[Cc]inamon').sum()

0

This is the type of essential data exploration that is possible with Pandas string tools.
It is data munging like this that Python really excels at.

### A simple recipe recommender

Let's go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.
While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.
So we will cheat a bit: we'll start with a list of common ingredients, and simply search to see whether they are in each recipe's ingredient list.
For simplicity, let's just stick with herbs and spices for the time being:

In [147]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

We can then build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list:

In [149]:
import re
spice_df = pd.DataFrame(dict((spice, recipes_df.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
22456,False,False,False,False,False,False,False,False,False,False
22455,False,False,False,False,False,False,False,False,False,False
22454,False,False,False,False,False,False,False,False,False,False
22453,False,False,False,False,False,False,False,False,False,False
22452,False,False,False,False,False,False,False,False,False,False


Now, as an example, let's say we'd like to find a recipe that uses parsley, paprika, and tarragon.
We can compute this very quickly using the ``query()`` method of ``DataFrame``s, discussed in [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb):

In [150]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)

1

We find only 10 recipes with this combination; let's use the index returned by this selection to discover the names of the recipes that have this combination:

In [152]:
recipes_df.name[selection.index]

2069    All cremat with a Little Gem, dandelion and wa...
Name: name, dtype: object

Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we'd like to cook for dinner.

### try another spice list

In [154]:
spice_list_ital = ['oregano' , 'basil']
spice_df = pd.DataFrame(dict((spice, recipes_df.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list_ital))
spice_df.head()

Unnamed: 0,oregano,basil
22456,False,False
22455,False,False
22454,False,False
22453,False,False
22452,False,False


In [155]:
selection = spice_df.query('oregano & basil')
len(selection)
recipes_df.name[selection.index]

65

15741    Lasagna with Turkey Sausage Bolognese
15551                            Sausage Rolls
14881                Sun Dried Tomato Marinade
14843                Eggs with Tomato on Toast
14102    Herb-and-Spice Southern Fried Chicken
                         ...                  
869        Eggplant-Pepper Tomato Sauce Recipe
833        Eggplant-Pepper Tomato Sauce Recipe
467                     Nigel’s harvest supper
424                                   Cioppino
33              Cauliflower Pizza Crust Recipe
Name: name, Length: 65, dtype: object

### Going further with recipes

Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods.
Of course, building a very robust recipe recommendation system would require a *lot* more work!
Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process.
This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.

<!--NAVIGATION-->
< [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >

<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
