# Vectorized string functions in pandas

Cleaning up a messy data set for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:

In [1]:
import pandas as pd
import numpy as np
import re
from pandas import DataFrame, Series

In [3]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [5]:
data = Series(data)

In [6]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

String and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA. To cope with this, Series has concise methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains:

In [7]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE:

In [8]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [10]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave      []
Steve     []
Rob       []
Wes      NaN
dtype: object

In [11]:
matches = data.str.match(pattern, flags=re.IGNORECASE)

matches

Dave     False
Steve    False
Rob      False
Wes        NaN
dtype: object

![Vectorized string methods](../../Pictures/Vectorized%20string%20methods.png)