Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

In [2]:
import numpy as np
import pandas as pd


In [3]:
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

array([ 4,  6, 10, 14, 22, 26])

In [4]:
# data = ['peter', 'Paul', 'MARY', 'gUIDO']
# [s.capitalize() for s in data]

This is perhaps sufficient to work with some data, but it will break if there are any missing values. For example:

data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the str attribute of Pandas Series and Index objects containing strings. So, for example, suppose we create a Pandas Series with this data:

In [5]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
names=pd.Series(data)
names

0    peter
1     Paul
2     MARY
3    gUIDO
dtype: object

In [6]:
names.str.capitalize()

0    Peter
1     Paul
2     Mary
3    Guido
dtype: object

## Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:

    len()	lower()	translate()	islower()

    ljust()	upper()	startswith()	isupper()
    
    rjust()	find()	endswith()	isnumeric()

    center()	rfind()	isalnum()	isdecimal()

    zfill()	index()	isalpha()	split()

    strip()	rindex()	isdigit()	rsplit()

    rstrip()	capitalize()	isspace()	partition()

    lstrip()	swapcase()	istitle()	rpartition()

Notice that these have various return values. Some, like lower(), return a series of strings:

In [7]:
DS = pd.Series(['shubham  chavan', 'harshad  kalsait', 'bhushan  bhagul',
                   'magadh singh', 'pavan bhandare', 'rohit kulhat','gauri auti'])

In [8]:
DS.str.lower()

0     shubham  chavan
1    harshad  kalsait
2     bhushan  bhagul
3        magadh singh
4      pavan bhandare
5        rohit kulhat
6          gauri auti
dtype: object

In [9]:
DS.str.len()

0    15
1    16
2    15
3    12
4    14
5    12
6    10
dtype: int64

In [10]:
DS.str.startswith('h')

0    False
1     True
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [11]:
DS.str.split()

0     [shubham, chavan]
1    [harshad, kalsait]
2     [bhushan, bhagul]
3       [magadh, singh]
4     [pavan, bhandare]
5       [rohit, kulhat]
6         [gauri, auti]
dtype: object

In [12]:
DS.str.strip()

0     shubham  chavan
1    harshad  kalsait
2     bhushan  bhagul
3        magadh singh
4      pavan bhandare
5        rohit kulhat
6          gauri auti
dtype: object

### Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in re module:

Method	Description

    match()	Call re.match() on each element, returning a boolean.
        
    findall()	Call re.findall() on each element
    
    replace()	Replace occurrences of pattern with some other string
    
    contains()	Call re.search() on each element, returning a boolean
    
    count()	Count occurrences of pattern
    
    split()	Equivalent to str.split(), but accepts regexps
    
    rsplit()	Equivalent to str.rsplit(), but accepts regexps
    
With these, you can do a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

In [13]:
DS.str.extract('([A-Za-z]+)', expand=False)

0    shubham
1    harshad
2    bhushan
3     magadh
4      pavan
5      rohit
6      gauri
dtype: object

In [14]:
DS.str.extract('([A-Za-z].+)', expand=True )

Unnamed: 0,0
0,shubham chavan
1,harshad kalsait
2,bhushan bhagul
3,magadh singh
4,pavan bhandare
5,rohit kulhat
6,gauri auti


In [15]:
DS.str.extract('([A-Za-z]+.[0-9]+.[A-Za-z]+)', expand=False)

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
dtype: object

In [16]:
DS.str.extract('([A-Z a-z]+)', expand=False)

0     shubham  chavan
1    harshad  kalsait
2     bhushan  bhagul
3        magadh singh
4      pavan bhandare
5        rohit kulhat
6          gauri auti
dtype: object

Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (^) and end-of-string ($) regular expression characters

In [17]:
DS.str.findall(r'^[^AEIOU].*[^aeiou]$')

0     [shubham  chavan]
1    [harshad  kalsait]
2     [bhushan  bhagul]
3        [magadh singh]
4                    []
5        [rohit kulhat]
6                    []
dtype: object

In [18]:
DS.str.findall(r'^[^AEIOUaeiou].*[^AEIOUaeiou]$')

0     [shubham  chavan]
1    [harshad  kalsait]
2     [bhushan  bhagul]
3        [magadh singh]
4                    []
5        [rohit kulhat]
6                    []
dtype: object

### Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations:

Method	Description

    get()	Index each element
    
    slice()	Slice each element

    slice_replace()	Replace slice in each element with passed value

    cat()	Concatenate strings

    repeat()	Repeat values

    normalize()	Return Unicode form of string

    pad()	Add whitespace to left, right, or both sides of strings

    wrap()	Split long strings into lines with length less than a given width

    join()	Join strings in each element of the Series with passed separator

    get_dummies()	extract dummy variables as a dataframe


###  Vectorized item access and slicing

In [25]:
full_DS = pd.DataFrame({'name': op,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D','A|B|C']})
full_DS

Unnamed: 0,name,info
0,SHUBHAM CHAVAN,B|C|D
1,HARSHAD KALSAIT,B|D
2,BHUSHAN BHAGUL,A|C
3,MAGADH SINGH,B|D
4,PAVAN BHANDARE,B|C
5,ROHIT KULHAT,B|C|D
6,GAURI AUTI,A|B|C


In [20]:
op=DS.str.swapcase()
op

0     SHUBHAM  CHAVAN
1    HARSHAD  KALSAIT
2     BHUSHAN  BHAGUL
3        MAGADH SINGH
4      PAVAN BHANDARE
5        ROHIT KULHAT
6          GAURI AUTI
dtype: object

In [26]:
full_DS['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1
6,1,1,1,0


# case study - Recipe Database

In [27]:
Recipe=pd.read_json('openrecipes.txt', lines = True)
Recipe

Unnamed: 0,name,ingredients,url,image,cookTime,recipeYield,datePublished,prepTime,description
0,Easter Leftover Sandwich,12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\...,http://thepioneerwoman.com/cooking/2013/04/eas...,http://static.thepioneerwoman.com/cooking/file...,PT,8,2013-04-01,PT15M,Got leftover Easter eggs? Got leftover East...
1,Pasta with Pesto Cream Sauce,3/4 cups Fresh Basil Leaves\n1/2 cup Grated Pa...,http://thepioneerwoman.com/cooking/2011/06/pas...,http://static.thepioneerwoman.com/cooking/file...,PT10M,8,2011-06-06,PT6M,I finally have basil in my garden. Basil I can...
2,Herb Roasted Pork Tenderloin with Preserves,"2 whole Pork Tenderloins\n Salt And Pepper, to...",http://thepioneerwoman.com/cooking/2011/09/her...,http://static.thepioneerwoman.com/cooking/file...,PT15M,12,2011-09-15,PT5M,This was yummy. And easy. And pretty! And it t...
3,Chicken Florentine Pasta,"1 pound Penne\n4 whole Boneless, Skinless Chic...",http://thepioneerwoman.com/cooking/2012/04/chi...,http://static.thepioneerwoman.com/cooking/file...,PT20M,10,2012-04-23,PT10M,"I made this for a late lunch Saturday, and it ..."
4,Perfect Iced Coffee,"1 pound Ground Coffee (good, Rich Roast)\n8 qu...",http://thepioneerwoman.com/cooking/2011/06/per...,http://static.thepioneerwoman.com/cooking/file...,PT,24,2011-06-13,PT8H,"Iced coffee is my life. When I wake up, often ..."
...,...,...,...,...,...,...,...,...,...
1037,Golden Potstickers,1/2 cup sunflower oil\n8 green onions / scalli...,http://www.101cookbooks.com/archives/golden-po...,http://www.101cookbooks.com/mt-static/images/f...,PT10M,Makes a big platter of dumplings.,2011-10-05,PT60M,"Potstickers - For my flight to London, I made..."
1038,Gougères,2/3 cup / 160 ml beer / ale OR water\n1/3 cup...,http://www.101cookbooks.com/archives/gougares-...,http://www.101cookbooks.com/mt-static/images/f...,PT30M,,2011-12-17,PT10M,Gougères - I have these little cheese puffs in...
1039,Parmesan Cheese Spread,2 1/2 cups / 5 1/2 oz / 150g finely grated Par...,http://www.101cookbooks.com/archives/parmesan-...,http://www.101cookbooks.com/mt-static/images/f...,,Makes ~1 1/2 cups of spread.,2012-05-09,PT5M,A Parmesan cheese spread made with grated chee...
1040,Mast-o-Khiar Yogurt Dip,"2 medium garlic cloves, peeled\n1/2 teaspoon f...",http://www.101cookbooks.com/archives/mastokhia...,http://www.101cookbooks.com/mt-static/images/f...,,Serves 6-8.,2012-09-08,PT5M,The prettiest dip in my repertoire - my take o...


In [23]:
Recipe.shape

(1042, 9)

In [139]:
Recipe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1042 entries, 0 to 1041
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           1042 non-null   object
 1   ingredients    1042 non-null   object
 2   url            1042 non-null   object
 3   image          1042 non-null   object
 4   cookTime       1042 non-null   object
 5   recipeYield    1042 non-null   object
 6   datePublished  1042 non-null   object
 7   prepTime       1042 non-null   object
 8   description    1042 non-null   object
dtypes: object(9)
memory usage: 73.4+ KB


In [140]:
Recipe.iloc[0]

name                                      Easter Leftover Sandwich
ingredients      12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\...
url              http://thepioneerwoman.com/cooking/2013/04/eas...
image            http://static.thepioneerwoman.com/cooking/file...
cookTime                                                        PT
recipeYield                                                      8
datePublished                                           2013-04-01
prepTime                                                     PT15M
description      Got leftover Easter eggs?    Got leftover East...
Name: 0, dtype: object

In [141]:
# print(len(Recipe['ingredients'].max()))

In [142]:
Recipe.ingredients.str.len().describe()     

count    1042.000000
mean      358.643954
std       187.330319
min        22.000000
25%       246.250000
50%       338.000000
75%       440.000000
max      3160.000000
Name: ingredients, dtype: float64

In [143]:
Recipe.ingredients.str.len().max()

3160

In [144]:
Recipe.name[np.argmax(Recipe.ingredients.str.len())]

'A Nice Berry Pie'

In [145]:
Recipe.description.str.contains('[Bb]reakfast').sum()

11

In [146]:
Recipe.name[Recipe.description.str.contains('[Bb]reakfast')]

24             Sausage-Kale Breakfast Strata
38                     Petite Vanilla Scones
204                 Breakfast Burritos to Go
307              Cinnamon Baked French Toast
377                          Creamed Spinach
900     Lori's Skillet Smashed Potato Recipe
904                 Breakfast Polenta Recipe
907    Warm and Nutty Cinnamon Quinoa Recipe
909               Wheat Berry Breakfast Bowl
915                Toasted Four Grain Cereal
918                  Pomegranate Yogurt Bowl
Name: name, dtype: object

In [147]:
Recipe.ingredients.str.contains('[CIci]nnamon').sum()

78

In [148]:
Recipe.name[Recipe.ingredients.str.contains('[CIci]nnamon')]

11                           Baked French Toast
39              Individual Cherry Almond Crisps
43                         Oatmeal Whoopie Pies
54             Grillin’ Recipe Contest Winners!
69                         I Love Ya, Tomorrow!
                         ...                   
982                          Carrot Cake Recipe
984     Breton Buckwheat Cake with Fleur de Sel
997             Black Sticky Gingerbread Recipe
1008                   Buttermilk Berry Muffins
1014              Spiced Candied Walnuts Recipe
Name: name, Length: 78, dtype: object

In [149]:
Recipe.ingredients.str.contains('[Cc]inamon').sum()

2

In [None]:
##change the spelling by own

In [166]:
Recipe.ingredients.str.contains('Wine' and '[Ss]ugar').sum()

377

In [169]:
Recipe.name[Recipe.ingredients.str.contains('Wine' and '[Ss]ugar')]

4                   Perfect Iced Coffee
8                       Yum. Doughnuts!
11                   Baked French Toast
12         Yummy Slice-and-Bake Cookies
16                    Mango Margaritas!
                     ...               
1014      Spiced Candied Walnuts Recipe
1017              Animal Cracker Recipe
1020    Sweet &amp; Spicy Pumpkin Seeds
1035                   Oatmeal Crackers
1037                 Golden Potstickers
Name: name, Length: 377, dtype: object

In [171]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley','rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

In [173]:
import re
spice_df = pd.DataFrame(dict((spice, Recipe.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [191]:
so1 = spice_df.query('parsley & paprika')
len(so1)

2

In [195]:
Recipe.name[so1.index]

637            Pappardelle with Spiced Butter
1010    Herb Jam with Olives and Lemon Recipe
Name: name, dtype: object

In [184]:
spice_df.query('parsley & paprika').max()


salt         True
pepper       True
oregano     False
sage        False
parsley      True
rosemary    False
tarragon    False
thyme       False
paprika      True
cumin        True
dtype: bool

In [177]:
sweets_list=['Wine','Sugar']

In [188]:
import re
sweets_df = pd.DataFrame(dict((words, Recipe.ingredients.str.contains(words, re.IGNORECASE))
                             for words in sweets_list))
sweets_df.head()

Unnamed: 0,Wine,Sugar
0,False,False
1,False,False
2,False,False
3,True,False
4,False,True


In [192]:
so2 = sweets_df.query('Wine & Sugar')
len(so2)

21

In [193]:
Recipe.name[so2.index]

48     Maple Glazed Chicken Kabobs with Sweet Jalapen...
60                          The Best Macaroni Salad Ever
71                                       Spaghetti Sauce
139                              Beef Noodle Salad Bowls
148                 Meatballs with Peppers and Pineapple
167                           Short Ribs in Tomato Sauce
179                                          Greek Salad
214                                 Black Eyed Pea Salsa
238           Cornbread Dressing with Sausage and Apples
239    Three Cheese-Stuffed Shells with Meaty Tomato ...
245                                      Big Steak Salad
286                            Spaghetti &amp; Meatballs
324                                   Chicken Parmigiana
423                        Sushi 101: Perfect Sushi Rice
446                    What I Made for Lunch on Saturday
451                         Linguine with Chicken Thighs
458                                     My Spinach Salad
459                            

In [194]:
so3 = spice_df.query('parsley & oregano')
len(so3)

3

In [196]:
Recipe.name[so3.index]

631    Summer Squash Gratin Recipe
797       Oregano Brussels Sprouts
913      Super-eggy Scrambled Eggs
Name: name, dtype: object