### Vectorized String Operations

#### Introducing Pandas String Operations


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [7]:
# capitalized words
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

In [9]:
# code breaks if missing value
data.insert(2, None)

In [10]:
data

['peter', 'Paul', None, 'MARY', 'gUIDO']

In [11]:
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for **vectorized string operations** and for **correctly handling missing data** via the `str` attribute of *Padas Series* and *Index* objects containing strings.

**names.str**:
    1. Vectorized string functions for Series and Index. 
    2. NAs stay NA unless handled otherwise by a particular method. 
    3. Patterned after Python's string methods, with some inspiration from R's stringr package.

**dir(names.str)**:

'capitalize', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill'

In [21]:
names.str.capitalize?

In [12]:
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

### Tables of Pandas String Methods

#### Methods similar to Python string methods

* A list of Pandas str methods that mirror Python string methods:
<img src="files/str_methods.png">

In [3]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [26]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [29]:
monte.str.ljust(width = 14, fillchar=' ')

0    Graham Chapman
1    John Cleese   
2    Terry Gilliam 
3    Eric Idle     
4    Terry Jones   
5    Michael Palin 
dtype: object

In [31]:
monte.str.rjust(width=20, fillchar=' ')

0          Graham Chapman
1             John Cleese
2           Terry Gilliam
3               Eric Idle
4             Terry Jones
5           Michael Palin
dtype: object

In [34]:
monte.str.center(width=14, fillchar=" ")

0    Graham Chapman
1     John Cleese  
2    Terry Gilliam 
3      Eric Idle   
4     Terry Jones  
5    Michael Palin 
dtype: object

In [36]:
# Filling left side of strings in the Series/Index with 0.
monte.str.zfill(width=14)

0    Graham Chapman
1    000John Cleese
2    0Terry Gilliam
3    00000Eric Idle
4    000Terry Jones
5    0Michael Palin
dtype: object

In [43]:
monte.str.lstrip().str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

monte.str.find(sub, start=0, end=None)
1. Return lowest indexes in each strings in the Series/Index where the substring is fully contained between [start:end].
2. Return -1 on failure

monte.str.rfind(sub, start=0, end=None)
3. Return the highest indexes

monte.str.rindex(sub, start=0, end=None)
4. Return highest indexes in each strings where the substring is fully contained between [start:end]. 
5. This is the same as ``str.rfind`` except instead of returning -1, it raises a ValueError when the substring is not found. 

In [47]:
monte.str.find(' C')

0    6
1    4
2   -1
3   -1
4   -1
5   -1
dtype: int64

In [55]:
pd.DataFrame({'Lowest': monte.str.find('e'),
              'Highest': monte.str.rfind('e')})

Unnamed: 0,Highest,Lowest
0,-1,-1
1,10,7
2,1,1
3,8,8
4,9,1
5,5,5


In [57]:
monte.str.swapcase()

0    gRAHAM cHAPMAN
1       jOHN cLEESE
2     tERRY gILLIAM
3         eRIC iDLE
4       tERRY jONES
5     mICHAEL pALIN
dtype: object

Signature: monte.str.translate(table, deletechars=None)

Docstring:
Map all characters in the string through the given mapping table.
Equivalent to standard :meth:`str.translate`. Note that the optional
argument deletechars is only valid if you are using python 2. For python 3,
character deletion should be specified via the table argument.

In [80]:
a = list('abcdefg')
table_a = {}
for i in range(len(a)):
    table_a[a[i]]=str(i)+a[i]
table_a

{'a': '0a', 'b': '1b', 'c': '2c', 'd': '3d', 'e': '4e', 'f': '5f', 'g': '6g'}

In [81]:
b = pd.Series(a)
b

0    a
1    b
2    c
3    d
4    e
5    f
6    g
dtype: object

In [83]:
b.str.translate(table_a.values())

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
dtype: float64

In [86]:
monte.str.startswith('Gr')

0     True
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [87]:
monte.str.endswith('se')

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

**Check whether all characters in each string in the Series/Index are:**

isalnum: alphanumeric

isalpha: alphabetic

isdigit: digits

isspace: whitespace

istitle: titlecase

islower: lowercase

isupper: uppercase

isnumeric: numeric

isdecimal: decimal

In [120]:
monte.str.split(pat = 'e', expand=True)

# if pat = None, split on whitespace

Unnamed: 0,0,1
0,Graham,Chapman
1,John,Cleese
2,Terry,Gilliam
3,Eric,Idle
4,Terry,Jones
5,Michael,Palin


In [119]:
monte.str.rsplit(pat='e', expand=True)

# starting at the end of the string and working to the front

Unnamed: 0,0,1
0,Graham,Chapman
1,John,Cleese
2,Terry,Gilliam
3,Eric,Idle
4,Terry,Jones
5,Michael,Palin


In [124]:
monte.str.partition('z')

# Split the string at the first occurrence of `sep`, 
# and return 3 elements containing the part before the separator, 
# the separator itself, and the part after the separator.
# If the separator is not found, 
# return 3 elements containing the string itself, 
# followed by two empty strings.

Unnamed: 0,0,1,2
0,Graham Chapman,,
1,John Cleese,,
2,Terry Gilliam,,
3,Eric Idle,,
4,Terry Jones,,
5,Michael Palin,,


In [126]:
monte.str.rpartition ('e')
# Split the string at the last occurrence of `sep`

Unnamed: 0,0,1,2
0,,,Graham Chapman
1,John Clees,e,
2,T,e,rry Gilliam
3,Eric Idl,e,
4,Terry Jon,e,s
5,Micha,e,l Palin


#### Methods using regular expressions
<img src="files/re_module.png">

In [128]:
# extract the first name from each by asking for a contiguous group 
# of characters at the beginning of each element
monte.str.extract('([A-Za-z]+)', expand=True)

Unnamed: 0,0
0,Graham
1,John
2,Terry
3,Eric
4,Terry
5,Michael


In [4]:
# finding all names that start and end with a consonant, 
# making use of the start-of-string (^) 
# and end-of-string ($) regular expression characters
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

#### Miscellaneous methods
<img src="files/pd_str_methods.PNG" width=400>


#### Vectorized item access and slicing
* monte.str.get(i)
* monte.str.slice(start=None, stop=None, step=None)
    * get() and slice() methods also let you access elements of arrays returned by split().


In [25]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [26]:
monte.str.slice(0,3,1)

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [27]:
# extract the last name of each entry
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

#### Indicator variables
* monte.str.get_dummies()
* when your data has a column containing some sort of coded indicator. 
* For example:
    * A=“born in America,” 
    * B=“born in the United Kingdom,”
    * C=“likes cheese,” D=“likes spam”

In [30]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C',
                                    'B|C|D']})
full_monte

Unnamed: 0,info,name
0,B|C|D,Graham Chapman
1,B|D,John Cleese
2,A|C,Terry Gilliam
3,B|D,Eric Idle
4,B|C,Terry Jones
5,B|C|D,Michael Palin


In [34]:
full_monte['info'].str.get_dummies("|")

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


### Example: Recipe Database

In [35]:
!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
!gunzip recipeitems-latest.json.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    20  100    20    0     0     42      0 --:--:-- --:--:-- --:--:--    42
'gunzip' is not recognized as an internal or external command,
operable program or batch file.
