# Vectorized String Operations

In [1]:
import numpy as np
import pandas as pd

Pandas includes features to address both the need for vectorized string operations and for correctly handling missing data via the **str** attribute of Pandas Series and Index objects containing strings.

In [3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
names = pd.Series(data)
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

**Methods similar to Python string methods**

![](3.jpg)

In [4]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [5]:
monte.str.lower()
monte.str.len()
monte.str.startswith('T')
monte.str.split()

**Methods using regular expressions**

![](4.jpg)

In [7]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [10]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

**Miscellaneous methods**

![](5.jpg)

In [None]:
monte.str[0:3]
monte.str.slice(0, 3)

In [15]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

**Indicator variables**  
Another method that requires a bit of extra explanation is the **get_duummies()** method. This is useful when your data has a column containing some of coded indicator.

In [17]:
full_monte = pd.DataFrame({'name': monte, 'info': ['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C', 'B|C|D']})

In [18]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


**Example: Recipe Database**

In [None]:
recipes.ingredients.str.len().describe()
recipes.name[np.argmax(recipes.ingredients.str.len())]
recipes.description.str.contains('[Bb]reakfast').sum()

In [None]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE)) 
                             for spice in spice_list))

In [None]:
selection = spice_df.quuery('parsley & paprika & tarragon')
recipes.name[selection.index]