# Chapter 7: Data Cleaning and Preparation

A significant time (80%) of the total time spent by data analysis is spent on
data preparation: loading, cleaning, transforming, and rearranging.

pandas provides a high-level, flexible, and fast set of tools for this purpose.

In [1]:
import pandas as pd
import numpy as np

## 7.1 Handling Missing Data

Missing data occurse commonly, and one of the goals of pandas is to make working
with missing data as painless as possible. 

The way that missing data is represented in pandas is using NaN for representing
missing data. Also the Python value None.

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

NA handling methods:
- dropna
- fillna
- isnull
- notnull

### Filtering Out Missing Data



dropna on a Series returns the Series with only the non-null data and indecies.

In [5]:
from numpy import nan as NA

In [6]:
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [7]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

On an DataFrame, this are a little more complicated. 

You can drop rows or columns that are all NA or with any NA.

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [9]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [10]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [11]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DatFrame rows tends to concern time series data.

You can include only rows that contain a certain number of non-null points.

In [12]:
df = pd.DataFrame(np.random.randn(7,3))
df

Unnamed: 0,0,1,2
0,-1.096138,0.33702,0.463966
1,1.60246,-0.96472,-2.561998
2,-0.354215,-0.692127,-0.467213
3,-0.795339,-0.872072,0.659877
4,1.887236,1.003942,0.132719
5,0.219748,-0.188657,-1.000725
6,-0.210705,0.081416,-0.200455


In [13]:
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-1.096138,,
1,1.60246,,
2,-0.354215,,-0.467213
3,-0.795339,,0.659877
4,1.887236,1.003942,0.132719
5,0.219748,-0.188657,-1.000725
6,-0.210705,0.081416,-0.200455


In [14]:
df.dropna()

Unnamed: 0,0,1,2
4,1.887236,1.003942,0.132719
5,0.219748,-0.188657,-1.000725
6,-0.210705,0.081416,-0.200455


In [15]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-0.354215,,-0.467213
3,-0.795339,,0.659877
4,1.887236,1.003942,0.132719
5,0.219748,-0.188657,-1.000725
6,-0.210705,0.081416,-0.200455


### Filling In Missing Data

Rather than filtering out missing data, you can also fill in those holes with
several methods including fill, interpolation, and padding.

In [16]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-1.096138,0.0,0.0
1,1.60246,0.0,0.0
2,-0.354215,0.0,-0.467213
3,-0.795339,0.0,0.659877
4,1.887236,1.003942,0.132719
5,0.219748,-0.188657,-1.000725
6,-0.210705,0.081416,-0.200455


Calling fillna with a dict allows you to define a fill value for each column

In [17]:
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,-1.096138,0.5,0.0
1,1.60246,0.5,0.0
2,-0.354215,0.5,-0.467213
3,-0.795339,0.5,0.659877
4,1.887236,1.003942,0.132719
5,0.219748,-0.188657,-1.000725
6,-0.210705,0.081416,-0.200455


You can also ffill and bfill with fillna

In [18]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.479181,-0.4186,-0.477484
1,0.277217,0.370423,-1.846068
2,0.169345,,-0.96407
3,-0.672873,,-1.246081
4,0.189906,,
5,-0.070733,,


In [19]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.479181,-0.4186,-0.477484
1,0.277217,0.370423,-1.846068
2,0.169345,0.370423,-0.96407
3,-0.672873,0.370423,-1.246081
4,0.189906,0.370423,-1.246081
5,-0.070733,0.370423,-1.246081


In [20]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.479181,-0.4186,-0.477484
1,0.277217,0.370423,-1.846068
2,0.169345,0.370423,-0.96407
3,-0.672873,0.370423,-1.246081
4,0.189906,,-1.246081
5,-0.070733,,-1.246081


In [21]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,-0.479181,-0.4186,-0.477484
1,0.277217,0.370423,-1.846068
2,0.169345,-0.024089,-0.96407
3,-0.672873,-0.024089,-1.246081
4,0.189906,-0.024089,-1.133426
5,-0.070733,-0.024089,-1.133426


fillna function arguments
- value
- method
- axis
- inplace
- limit

## 7.2: Data Transformation

Filtering and cleaning are another class of important data operations

### Removing Duplicates

In [22]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [23]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [25]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods consider all the columns, buyt you can alternatively
select any subset of columns to detect duplicates.

In [26]:
data['v1'] = range(7)

In [27]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


### Transforming Data Using a Function or Mapping

Many times, we want to perform some transformation based of the values in a 
dataset.

In [33]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Now we will make a dict relating each meat to the animal or origin.

In [34]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

In [35]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [36]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


Above, we use a dict to map each element in lowercased to the relating animal
in meat_to_animal

We could've also passed a function that returns a value given one

In [37]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replacing Values

Filling in missing data with `fillna` is a special case of the more general value
replacement possible with `map` but easier with `replace`.

In [38]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

These -999 values may be sentinals for missing data in a dataset.

In [39]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [40]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [41]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [42]:
data.replace({-999:np.nan, -1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

Note: data.replace is different from data.str.replace which replaces characters
within the element string

### Renaming Axis Indexes

Axis labels can be similarly transformed by a function or mapping.

In [43]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [44]:
transform = lambda x: x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [45]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [46]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [47]:
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [49]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretization and Binning

Continuous data is often discretized or separated into 'bins' for analysis.

In [51]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object.

The output you see describes the bins computed by `pandas.cut`.

You can treat it like an array of strings indicating the bin name; internally it
contains a categories array specifying the discrict category names

In [52]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [53]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [54]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

You can also change the interval from left-open right-closed to left-closed
right-open

In [55]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

Finally, you can apply group labels to the bins to make it easier to interpret

In [57]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to cut instead of bin edges, it will 
compute equal-length bins based on the min and max values in the data.

In [58]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.11, 0.32], (0.11, 0.32], (0.32, 0.54], (0.11, 0.32], (0.11, 0.32], ..., (0.11, 0.32], (0.54, 0.75], (0.75, 0.97], (0.54, 0.75], (0.32, 0.54]]
Length: 20
Categories (4, interval[float64, right]): [(0.11, 0.32] < (0.32, 0.54] < (0.54, 0.75] < (0.75, 0.97]]

The precision=2 option limits the decimal precision to two digits

The function `qcut` bins the data based on sample quantiles. Depending on
the distribution of the data, using cut will not always result in each bin
having the same number of datapoints

In [59]:
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats

[(-2.598, -0.673], (-0.0171, 0.675], (0.675, 3.451], (-0.673, -0.0171], (0.675, 3.451], ..., (0.675, 3.451], (-2.598, -0.673], (-0.673, -0.0171], (0.675, 3.451], (-0.0171, 0.675]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.598, -0.673] < (-0.673, -0.0171] < (-0.0171, 0.675] < (0.675, 3.451]]

In [60]:
pd.value_counts(cats)

(-2.598, -0.673]     250
(-0.673, -0.0171]    250
(-0.0171, 0.675]     250
(0.675, 3.451]       250
dtype: int64

In [61]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-2.598, -1.287], (-0.0171, 1.247], (-0.0171, 1.247], (-1.287, -0.0171], (-0.0171, 1.247], ..., (-0.0171, 1.247], (-2.598, -1.287], (-1.287, -0.0171], (1.247, 3.451], (-0.0171, 1.247]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.598, -1.287] < (-1.287, -0.0171] < (-0.0171, 1.247] < (1.247, 3.451]]

### Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array 
operations.

In [66]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.011248,0.004856,-0.080359,0.00343
std,1.019157,1.006258,0.966304,1.005859
min,-2.994783,-3.371911,-3.419896,-3.879099
25%,-0.718733,-0.723029,-0.7834,-0.64801
50%,-0.022164,0.024514,-0.059646,0.008197
75%,0.73093,0.665393,0.572044,0.723137
max,3.499431,3.053661,3.050233,3.352008


In [67]:
data[2][np.abs(data[2]) > 3]

500   -3.419896
995    3.050233
Name: 2, dtype: float64

In [68]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
85,1.106449,-3.371911,1.242199,0.986362
189,3.340331,2.558076,-0.742,-0.128795
231,-0.314838,-0.950657,-1.478452,-3.012131
305,0.196903,-1.134775,-0.460778,3.006561
395,3.124329,-1.365134,-1.502519,0.875482
500,0.448256,-1.249248,-3.419896,-1.171032
645,0.420247,3.053661,-1.02474,-0.424078
787,0.454219,1.479456,-1.176144,-3.879099
818,-0.140829,1.349578,1.45554,-3.673265
831,1.11186,-3.158836,1.012837,-0.453553


In [69]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [70]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.010284,0.005435,-0.079989,0.004636
std,1.016135,1.004115,0.96478,0.99942
min,-2.994783,-3.0,-3.0,-3.0
25%,-0.718733,-0.723029,-0.7834,-0.64801
50%,-0.022164,0.024514,-0.059646,0.008197
75%,0.73093,0.665393,0.572044,0.723137
max,3.0,3.0,3.0,3.0


### Permutation and Random Sampling

Permuting a Series or the rows in a DataFrame is easy using 
`numpy.random.permutation` 

In [71]:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [73]:
sampler = np.random.permutation(5)
sampler

array([3, 4, 1, 0, 2])

In [74]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
4,16,17,18,19
1,4,5,6,7
0,0,1,2,3
2,8,9,10,11


In [75]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
0,0,1,2,3
4,16,17,18,19
3,12,13,14,15


To generate a sample with replacement, pass replace=True

In [76]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

3    6
1    7
0    5
0    5
1    7
0    5
1    7
0    5
4    4
0    5
dtype: int64

### Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning
applications is converting a categorical variable into an indicator matrix.

If a column in a DataFrame has k distinct values, you would derive a matrix or
DataFrame with k columns containing all 1s and 0s.

In [77]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [78]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are a little
more complicated.

In [80]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::', 
                        header=None, names=mnames)
movies.head()

  movies = pd.read_table('datasets/movielens/movies.dat', sep='::',


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [83]:
all_genres = []
for x in movies['genres']:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [85]:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

In [86]:
gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

In [87]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2])

In [88]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [89]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                              1.0
Genre_Children's                             1.0
Genre_Comedy                                 1.0
Genre_Adventure                              0.0
Genre_Fantasy                                0.0
Genre_Romance                                0.0
Genre_Drama                                  0.0
Genre_Action                                 0.0
Genre_Crime                                  0.0
Genre_Thriller                               0.0
Genre_Horror                                 0.0
Genre_Sci-Fi                                 0.0
Genre_Documentary                            0.0
Genre_War                                    0.0
Genre_Musical                                0.0
Genre_Mystery                                0.0
Genre_Film-Noir                              0.0
Genre_Western       

In [90]:
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [91]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


## 7.3: String Manipulation

pandas adds to the built-in Python string library by enabling you to apply
string and regex concisely on whole arrays of data.

### String Object Methods

In most string munging and scripting applications, the built-in methods are 
sufficient.
- count
- endswith
- startswith
- join
- index
- find
- rfind
- replace
- string
- rstrip
- lstrip
- split
- lower
- upper
- casefold
- ljust
- rjust

### Regular Expressions

Regular Expressions provide a flexible way to search or match string patterns
in text.

The `re` module functions fall into three categories: pattern matching,
substitution, and splitting.

In [92]:
import re

In [94]:
text = "foo    bar\t baz  \tqux"
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

If you want to reuse a regex multiple times, you should compile it first

In [95]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [96]:
regex.findall(text)

['    ', '\t ', '  \t']

match and search are closely related to findall. While findall returns all 
matches in a string, search returns only the first match. More rigidly, match
only matches at the beginning of the string.

In [99]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [100]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [102]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [103]:
text[m.start():m.end()]

'dave@google.com'

In [105]:
print(regex.match(text))

None


Relatedly, sub will return a new string with occurances of the pattern replaced
by the new string

In [106]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



Suppose you wanted to find email addresses and segment them into their three
components: username, domain name, and domain sufix.

To do this, put parentheses (groups) around the parts of the pattern to segment

In [107]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [108]:
m = regex.match('wesm@bright.net')

In [109]:
m.groups()

('wesm', 'bright', 'net')

In [110]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

Regular expression methods
- findall
- finditer
- match
- search
- split
- sub
- subn

### Vectorized String Functions in pandas

df.str.METHOD

Partial listing of vectorized string methods:
- cat
- contains
- count
- extract
- endswith
- startswith
- findall
- get
- isalnum
- isalpha
- isdecimal
- isdigit
- islower
- isnumeric
- isupper
- join
- len
- lower
- upper
- match
- pad
- center
- repeat
- replace
- slice
- split
- strip
- rstrip
- lstrip