# Notes for Chapter 7:

# Data Cleaning and Preparation

***

## 7.1 Handling Missing Data

Missing data is a common issue that is taken into well consideration for handling by pandas. 

The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected:

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
s_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [4]:
s_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [5]:
s_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python Value None is also treated as NA (not available) in object arrays:

In [6]:
s_data[0] = None

In [7]:
s_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [8]:
s_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### Filtering Out Missing Data

The __dropna__ method can be used to filter out the missing data. On a Series, it returns the Serires with only the non-null data and index values:

In [9]:
from numpy import nan as NA

In [10]:
data = pd.Series([1, 2, NA, NA, 5])

In [11]:
data.dropna()

0    1.0
1    2.0
4    5.0
dtype: float64

It is equivalent to :

In [12]:
data[data.notnull()]

0    1.0
1    2.0
4    5.0
dtype: float64

In DataFrame, we can drop rows or columns that are all NA or only those containing any NAs. __dropna__ by default drops any row containing a missing value:

In [13]:
df = pd.DataFrame([[2, 3, 7], [NA, NA, NA], [1, NA, NA], [NA, 6.2, 1.3]])

In [14]:
df

Unnamed: 0,0,1,2
0,2.0,3.0,7.0
1,,,
2,1.0,,
3,,6.2,1.3


In [15]:
cleaned = df.dropna()

In [16]:
cleaned

Unnamed: 0,0,1,2
0,2.0,3.0,7.0


By Passing how='all' will only drop rows that are all NA:

In [17]:
df.dropna(how='all')

Unnamed: 0,0,1,2
0,2.0,3.0,7.0
2,1.0,,
3,,6.2,1.3


We can specify the axis=1 to drop columns in same way:

In [18]:
df[4] = NA

In [19]:
df

Unnamed: 0,0,1,2,4
0,2.0,3.0,7.0,
1,,,,
2,1.0,,,
3,,6.2,1.3,


In [20]:
df.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,2.0,3.0,7.0
1,,,
2,1.0,,
3,,6.2,1.3


In [21]:
df.dropna(axis=1)

0
1
2
3


In [22]:
df = pd.DataFrame(np.random.randn(7, 3))

In [23]:
df

Unnamed: 0,0,1,2
0,-0.81643,1.787025,-0.459899
1,-0.515077,-1.393215,0.002064
2,0.323685,-0.060338,0.518014
3,-1.513161,-1.857259,1.90921
4,-0.380563,1.190534,-1.503102
5,-0.075108,0.743296,-0.594775
6,0.479865,-0.662678,0.992333


In [24]:
df.iloc[0, 0] = NA

In [25]:
df.iloc[1, 1:] = NA

In [26]:
df.iloc[2] = NA

In [27]:
df

Unnamed: 0,0,1,2
0,,1.787025,-0.459899
1,-0.515077,,
2,,,
3,-1.513161,-1.857259,1.90921
4,-0.380563,1.190534,-1.503102
5,-0.075108,0.743296,-0.594775
6,0.479865,-0.662678,0.992333


In [28]:
df.dropna(thresh=1)

Unnamed: 0,0,1,2
0,,1.787025,-0.459899
1,-0.515077,,
3,-1.513161,-1.857259,1.90921
4,-0.380563,1.190534,-1.503102
5,-0.075108,0.743296,-0.594775
6,0.479865,-0.662678,0.992333


__thresh__ indicates the least number of non null values to be present in the row to return.

In the above case, when thresh=1 is set, the row needs to have atleast 1 non-null value; if not the row will be dropped.

### Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), we may want to fill in the “holes” in any number of ways. For most purposes, the __fillna__ method is the workhorse function to use. Calling __fillna__ with a constant replaces missing values with that value:

In [29]:
df

Unnamed: 0,0,1,2
0,,1.787025,-0.459899
1,-0.515077,,
2,,,
3,-1.513161,-1.857259,1.90921
4,-0.380563,1.190534,-1.503102
5,-0.075108,0.743296,-0.594775
6,0.479865,-0.662678,0.992333


In [30]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.0,1.787025,-0.459899
1,-0.515077,0.0,0.0
2,0.0,0.0,0.0
3,-1.513161,-1.857259,1.90921
4,-0.380563,1.190534,-1.503102
5,-0.075108,0.743296,-0.594775
6,0.479865,-0.662678,0.992333


Calling __fillna__ with a dict, we can use a different fill value for each column:

In [31]:
df.fillna({0: 1, 1: 9, 2: 0})

Unnamed: 0,0,1,2
0,1.0,1.787025,-0.459899
1,-0.515077,9.0,0.0
2,1.0,9.0,0.0
3,-1.513161,-1.857259,1.90921
4,-0.380563,1.190534,-1.503102
5,-0.075108,0.743296,-0.594775
6,0.479865,-0.662678,0.992333


__fillna__ returns a new object, but we can modify the existing object in-place:

In [32]:
_ = df.fillna(0, inplace=True)

In [33]:
df

Unnamed: 0,0,1,2
0,0.0,1.787025,-0.459899
1,-0.515077,0.0,0.0
2,0.0,0.0,0.0
3,-1.513161,-1.857259,1.90921
4,-0.380563,1.190534,-1.503102
5,-0.075108,0.743296,-0.594775
6,0.479865,-0.662678,0.992333


In [34]:
df = pd.DataFrame(np.random.randn(6, 3))

In [35]:
df.iloc[2:, 1] = NA

In [36]:
df.iloc[4:, 2] = NA

In [37]:
df

Unnamed: 0,0,1,2
0,0.442452,-0.081845,0.519111
1,-0.872861,0.831903,0.959329
2,-0.087882,,0.06045
3,1.004507,,0.59033
4,-1.127586,,
5,-0.999492,,


In [38]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.442452,-0.081845,0.519111
1,-0.872861,0.831903,0.959329
2,-0.087882,0.831903,0.06045
3,1.004507,0.831903,0.59033
4,-1.127586,0.831903,0.59033
5,-0.999492,0.831903,0.59033


In [39]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.442452,-0.081845,0.519111
1,-0.872861,0.831903,0.959329
2,-0.087882,0.831903,0.06045
3,1.004507,0.831903,0.59033
4,-1.127586,,0.59033
5,-0.999492,,0.59033


With fillna we can do lots of other things with a little creativity. For example, we might pass the mean or median value of a Series:

In [40]:
data = pd.Series([1., NA, 3.5, NA, 7])

In [41]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

***

## 7.2 Data Transformation

### Removing Duplicates

Duplicate rows can be found in a DataFrame and __duplicated__ method returns a boolean Series indicating whether each row is a duplicate or not:

In [42]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})

In [43]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, __drop_duplicates__ returns a DataFrame where the duplicated array is
False:

In [44]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


By deafault, these methods act on all the columns but we can specify columns for specific operation on them.

In [45]:
data['v1'] = range(7)

In [46]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


__duplicated__ and __drop_duplicates__ by default keep the first observed value combination. Passing keep='last' will return the last one:

In [47]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

For many datasets, we may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [48]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [49]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Now, if we wanted a column that indicated the type of animal that each food came from. Let's write down a mapping of each distincy meat type of the kind of the animal it came from:

In [50]:
meat_to_animal={
 'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'
}

The __map__ method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats are capitalized and others are not. Thus, we need to convert each value to lowercase using the __str.lower__ Series method:

In [51]:
lowercased = data['food'].str.lower()

In [52]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [53]:
data['animal'] = lowercased.map(meat_to_animal)

In [54]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


alternatively, we could also have passed a function that does all the work:

In [55]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replacing Values

An method __replace__ is used to filling in missing data in more simpler way:

In [56]:
data = pd.Series(np.arange(5.))

In [57]:
data

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [58]:
data.replace(0, np.nan)

0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

we can produce a new series with replace as shown(unless we pass inplace=True)

To replace multiple Values at once, we can pass a list and then substitute the value:

In [59]:
data.replace([0, 4], np.nan)

0    NaN
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [60]:
data.replace([0, 4], [np.nan, 999])

0      NaN
1      1.0
2      2.0
3      3.0
4    999.0
dtype: float64

The argument passed can also be a dict:

In [61]:
data.replace({0: np.nan, 4: 999})

0      NaN
1      1.0
2      2.0
3      3.0
4    999.0
dtype: float64

### Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. we can also modify the axes in-place without creating a new data structure. Here’s a simple example:

In [62]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [63]:
transform = lambda x: x[:4].upper()

In [64]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

Like a Series, the axis indexes have a map method

we can assign to index, modifying the DataFrame in-place:

In [65]:
data.index = data.index.map(transform)

In [66]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If we want to create a transformed version of a dataset without modifying the original, a useful method is __rename__:

In [67]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, __rename__ can be used in conjunction with a dict-like object providing new val‐
ues for a subset of the axis labels:

In [68]:
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


We can pass inplace=True; if we want to modify the dataset in-place rather than creatinga new data structure:

In [69]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [70]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis. For example: suppose we have data about a group of people in a study, and we want to group them into discrete age buckets:

In [71]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [72]:
bins = [18, 25, 35, 60, 100]

In [73]:
cats = pd.cut(ages, bins)

In [74]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special __Categorical__ object. The output we see describes the bins computed by __pandas.cut__. we can treat it like any array of strings indicating the bin name.

Internally it contains a __categories__ array specifying the dis‐
tinct category names along with a labeling for the __ages__ data in the __codes__ attribute:

In [75]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [76]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [77]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.

Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). we can change which side is closed by passing right=False:

In [78]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

we can also pass our own bin names by passing a list or array to the labels option:

In [79]:
group_names = ['Youth', 'Adult', 'MiddleAged', 'Senior']

In [80]:
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'Adult', 'Youth', ..., 'Adult', 'Senior', 'MiddleAged', 'MiddleAged', 'Adult']
Length: 12
Categories (4, object): ['Youth' < 'Adult' < 'MiddleAged' < 'Senior']

### Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [81]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [82]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.001665,0.021206,-0.024954,0.043748
std,0.993666,1.004671,1.018453,0.995099
min,-2.667086,-3.089656,-2.722746,-3.085493
25%,-0.662001,-0.672623,-0.759164,-0.63512
50%,-0.016551,0.048869,-0.032564,0.062692
75%,0.668197,0.689148,0.716571,0.648805
max,3.757898,2.635587,3.250579,3.182067


Now, if we want to retreive values in one of the columns exceeding 3 in absolute value:

In [83]:
col = data[2]

In [84]:
col[np.abs(col) > 3]

3    3.250579
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, we can use the any method on a
boolean DataFrame:

In [85]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
3,-1.403057,0.165347,3.250579,0.750464
253,3.066115,-1.898957,0.795868,0.207256
294,-1.36256,-3.089656,2.343057,0.841212
315,-0.042172,-0.310214,-0.41351,-3.085493
380,0.972891,-3.015682,0.422086,-1.526465
431,-0.75626,-0.342495,0.856734,3.182067
969,3.757898,0.544772,0.475133,-1.035746


Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:

In [86]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [87]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.000841,0.021312,-0.025205,0.043651
std,0.990881,1.00435,1.017677,0.994276
min,-2.667086,-3.0,-2.722746,-3.0
25%,-0.662001,-0.672623,-0.759164,-0.63512
50%,-0.016551,0.048869,-0.032564,0.062692
75%,0.668197,0.689148,0.716571,0.648805
max,3.0,2.635587,3.0,3.0


The statement __np.sign(data)__ produces 1 and –1 values based on whether the values
in data are positive or negative:

In [88]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,-1.0,-1.0,1.0
1,-1.0,1.0,-1.0,-1.0
2,-1.0,-1.0,1.0,1.0
3,-1.0,1.0,1.0,1.0
4,1.0,1.0,-1.0,1.0


### Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the __numpy.random.permutation__ function. Calling __permutation__ with the length of the axis we want to permute produces an array of integers indicating the new ordering:

In [89]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

In [90]:
sampler = np.random.permutation(5)

In [91]:
sampler

array([4, 0, 2, 3, 1])

That array can then be used in iloc-based indexing or the equivalent __take__ function:

In [92]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [93]:
df.take(sampler)

Unnamed: 0,0,1,2,3
4,16,17,18,19
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
1,4,5,6,7


To select a random subset without replacement, we can use the __sample__ method on Series and DataFrame:

In [94]:
df.sample(n=2)

Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7


To generate a sample with replacement (to allow repeat choices), pass replace=True
to sample:

In [95]:
s1 = pd.Series(np.random.randn(5))

In [96]:
choices = s1.sample(n=10, replace=True)

In [97]:
choices

2   -0.529987
2   -0.529987
2   -0.529987
4    1.977558
2   -0.529987
1   -0.413098
2   -0.529987
4    1.977558
0   -2.192222
3    0.694295
dtype: float64

### Computing Indicator/ Dummy Variables

Another type of transformation  is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, we would derive a matrix or DataFrame with k columns containing all 1s and 0s. pandas has a __get_dummies__ function for doing this task.

In [98]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})

In [99]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


We can add prefix to the columns in the indicator DataFrame, which can then be merged with the other data. __get_dummies__ has a prefix argument for doing this:

In [100]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [101]:
df_with_dummy = df[['data1']].join(dummies)

In [102]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


***

## 7.3 String Manipulation

### String Object Methods

Built-in string methods are sufficient in many string munging and scripting applications. for example , a comma separated string can be broken into pieces with __split__:

In [103]:
val = 'hello, world, how, are, you ?'

In [104]:
val.split(',')

['hello', ' world', ' how', ' are', ' you ?']

__split__ method is often combined with __strip__ to trim whitespace (including line breaks):

In [105]:
pieces = [x.strip() for x in val.split(',')]

In [106]:
pieces

['hello', 'world', 'how', 'are', 'you ?']

These substrings could be concatenated together with a two colon delimiter using addition '+'

In [107]:
len(pieces)

5

In [108]:
a, b, c, d, e = pieces

In [109]:
a+'::'+b+'::'+c+'::'+d+'::'+e

'hello::world::how::are::you ?'

Another similar way to do this in more pratical manner is to pass a list or tuple to the __join__ method on the string '::'

In [110]:
'::'.join(pieces)

'hello::world::how::are::you ?'

Further, we can use __in__ keyword to detect a substring, though __index__ and __find__ can also be used:


In [111]:
'hello' in val

True

In [112]:
'bye' in val

False

In [113]:
val.index(',')

5

In [114]:
val.find(':')

-1

Note the difference between __find__ and __index__ is that index raises an exception if the string isn’t found (versus returning –1):

In [115]:
val.index(':')

ValueError: substring not found

similarly, __count__ returns the number of occurences of a particular substring:

In [116]:
val.count(',')

4

__replace__ will substitute occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:

In [117]:
val.replace(',', '+')

'hello+ world+ how+ are+ you ?'

In [118]:
val.replace(',', '')

'hello world how are you ?'

### Regular Expressions

Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings;

Let’s look at a simple example:
suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is \s+:

In [119]:
import re

In [120]:
text = "hello none \t thing     can \tcause"

In [121]:
re.split('\s+', text)

['hello', 'none', 'thing', 'can', 'cause']

The regular expression is first compiled, when we call re.split('\s+', text) and then its __split__ method is called on the passed text. We can compile regex with re.compile, forming a reusable regex object:

In [122]:
regex = re.compile('\s+')

In [123]:
regex.split(text)

['hello', 'none', 'thing', 'can', 'cause']

Now, if we want to get a list of all patterns matching the regext, we can use __findall__ method:

In [124]:
regex.findall(text)

[' ', ' \t ', '     ', ' \t']

__match__ and __search__ are closely related to findall. While __findall__ returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string

For example:  let’s consider a block of text and
a regular expression capable of identifying most email addresses:

In [125]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

In [126]:
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [127]:
# re.IGNORECASE makes the regex case-insensitive

regex = re.compile(pattern, flags = re.IGNORECASE)

Using __findall__ on the text produces a list of the email addresses:

In [128]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

__search__ returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

In [129]:
m = regex.search(text)

In [130]:
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [131]:
text[m.start():m.end()]

'dave@google.com'

__regex.match__ returns __None__, as it only will match if the pattern occurs at the start of the string:

In [132]:
print(regex.match(text))

None


Similarly, __sub__ will return a replaced new string that matches the occurrences of the pattern.

In [133]:
print(regex.sub('--last name --', text))

Dave --last name --
Steve --last name --
Rob --last name --
Ryan --last name --



Suppose we wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:

In [134]:
pattern =  r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [135]:
regex = re.compile(pattern, flags=re.IGNORECASE)

A __match__ object produced by this modified regex returns a tuple of the pattern com‐
ponents with its groups method:

In [137]:
m = regex.match('something@nothing.too')

In [138]:
m.groups()

('something', 'nothing', 'too')

__findall__ returns a list of tuples when the pattern has groups:

In [139]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

__sub__ method also has access to groups in each match using special symbols like \1 and \2. The symbol \1 corresponds to the first matched group, \2 corresponds to the second, and so forth:

In [140]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



### Vectorized String Functions in Pandas

Sometimes there are missing data in columns containing strings which could complicate matter. for example:

In [141]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [142]:
data = pd.Series(data)

In [143]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [144]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

For this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains:

In [145]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE:

In [146]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [147]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object