## 1 Handling Missing Data

In [85]:
import numpy as np

In [86]:
import pandas as pd

Python None value is treated as NA in object arrays

 <img src='img/7_1_1.png'>

### 1.1 Filtering Out Missing Data

Using `dropna` method to filter out missing data of a Series

In [3]:
from numpy import nan as NA

In [4]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [5]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

It's equivalent to:

In [7]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, you can drop rows or columns based on how many NAs contained

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                   [NA, NA, NA], [NA, 6.5, 3.]])

Will drop all rows containing NA (`how='any'`) in default

In [9]:
cleaned = data.dropna()

In [10]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [11]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Drop rows that are all NA:

In [12]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Drop columns

In [13]:
data[4] = NA

In [14]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [16]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Droping only rows containing a certain number of NA with `thresh` argument

In [17]:
df = pd.DataFrame(np.random.randn(7, 3))

In [18]:
df.iloc[:4, 1] = NA

In [19]:
df.iloc[:2, 2] = NA

In [20]:
df

Unnamed: 0,0,1,2
0,0.135889,,
1,1.621497,,
2,-1.03463,,0.388766
3,1.066616,,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


In [21]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-1.03463,,0.388766
3,1.066616,,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


### 1.2 Filling In Missing Data

In [22]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.135889,0.0,0.0
1,1.621497,0.0,0.0
2,-1.03463,0.0,0.388766
3,1.066616,0.0,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


Specify different fill value for each column by passing a dict

In [23]:
df.fillna({1:0.4, 2:0})

Unnamed: 0,0,1,2
0,0.135889,0.4,0.0
1,1.621497,0.4,0.0
2,-1.03463,0.4,0.388766
3,1.066616,0.4,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


Can modify the existing object in-place

In [24]:
df.fillna(0, inplace=True)

In [25]:
df

Unnamed: 0,0,1,2
0,0.135889,0.0,0.0
1,1.621497,0.0,0.0
2,-1.03463,0.0,0.388766
3,1.066616,0.0,-1.146681
4,-0.715149,-1.541511,1.531231
5,-0.452483,-1.696787,1.457432
6,0.956908,-0.193155,-0.353949


The same interpolation methods available for reindexing can be used with `fillna` as well

In [26]:
df = pd.DataFrame(np.random.randn(6, 3))

In [28]:
df.iloc[2:, 1] = NA

In [33]:
df.iloc[4:, 2] = NA

In [37]:
df

Unnamed: 0,0,1,2
0,0.420402,1.313372,0.234094
1,-0.24041,1.13984,-0.367249
2,-0.585492,,-0.012703
3,0.544509,,1.452135
4,-0.345364,,
5,0.378364,,


In [38]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.420402,1.313372,0.234094
1,-0.24041,1.13984,-0.367249
2,-0.585492,1.13984,-0.012703
3,0.544509,1.13984,1.452135
4,-0.345364,1.13984,1.452135
5,0.378364,1.13984,1.452135


In [39]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.420402,1.313372,0.234094
1,-0.24041,1.13984,-0.367249
2,-0.585492,1.13984,-0.012703
3,0.544509,1.13984,1.452135
4,-0.345364,,1.452135
5,0.378364,,1.452135


## 2 Data Transformation

### 2.1 Removing Duplicates

In [40]:
data = pd.DataFrame({'k1':['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})

In [41]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each row is a duplicate or not

In [42]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

`drop_duplicates` returns a Data Frame where the `duplicated` array is False

In [43]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of two methods by default consider all of the columns. You can specify any subset of them.

In [44]:
data['v1'] = range(7)

In [45]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


Two methods by default keep the first observed value combination. Passing `keep='last'` will return the last one

In [46]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### 2.2 Transforming Data Using a Function or Mapping

In [3]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
                              'corned beef', 'Bacon', 'pastrami', 'honey ham',
                              'nova lox']})

In [4]:
data

Unnamed: 0,food
0,bacon
1,pulled pork
2,bacon
3,Pastrami
4,corned beef
5,Bacon
6,pastrami
7,honey ham
8,nova lox


Add a column indicating the type of anmial that each food came from

In [5]:
meat_to_animal = {
      'bacon': 'pig',
      'pulled pork': 'pig',
      'pastrami': 'cow',
      'corned beef': 'cow',
      'honey ham': 'pig',
      'nova lox': 'salmon'
}

Using `str.lower()` Series method to convert each value to lowercase

In [8]:
lowercase = data['food'].str.lower()

In [9]:
# The map method on a Series accepts a function 
# or dict-like object containing a mapping
data['animal'] = lowercase.map(meat_to_animal)

In [10]:
data

Unnamed: 0,food,animal
0,bacon,pig
1,pulled pork,pig
2,bacon,pig
3,Pastrami,cow
4,corned beef,cow
5,Bacon,pig
6,pastrami,cow
7,honey ham,pig
8,nova lox,salmon


Or passing a function to do all the work

In [12]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### 2.3 Replacing Values

In [13]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [14]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

Relacing -999 with NA

In [15]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

Replace multiple values

In [17]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

Using a different replacement for each value

In [18]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [20]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### 2.4 Renaming Axis Indexes

In [30]:
data = pd.DataFrame(np.arange(12).reshape((3, 4,)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

Axis indexes have a `map` method

In [31]:
data.index.map(lambda x: x[:4].upper())

array(['OHIO', 'COLO', 'NEW '], dtype=object)

In [32]:
data.index = data.index.map(lambda x: x[:4].upper())

In [33]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Using `rename` method to create a fransformed version of a dataset without modifying the original

In [35]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [36]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [37]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### 2.5 Discretization and Binning

Grouping people into discrete age buckets

In [39]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Using pandas's `cut` function to dividing these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older.

In [40]:
bins = [18, 25, 35, 60, 100]

In [41]:
cats = pd.cut(ages, bins)

In [42]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object returnd is a `Categoriacal` object.

In [43]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [44]:
cats.categories

Index([u'(18, 25]', u'(25, 35]', u'(35, 60]', u'(60, 100]'], dtype='object')

In [45]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

You can also pass your own bin names by passing a list or array to the `labels` option

In [46]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [47]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If passing a integer instead of explicit bin edges (list) to cut, it will compute equal-length bins based on the minimum and maximum values in the data.

In [48]:
data = np.random.rand(20)

In [50]:
pd.cut(data, 4, precision=2)

[(0.49, 0.72], (0.49, 0.72], (0.72, 0.95], (0.25, 0.49], (0.72, 0.95], ..., (0.25, 0.49], (0.25, 0.49], (0.021, 0.25], (0.25, 0.49], (0.72, 0.95]]
Length: 20
Categories (4, object): [(0.021, 0.25] < (0.25, 0.49] < (0.49, 0.72] < (0.72, 0.95]]

`qcut` bins the data based on sample quantiles. 

In [51]:
data = np.random.randn(1000)

In [52]:
cats = pd.qcut(data, 4) # Cut into 4 quantiles

In [53]:
cats

[[-3.451, -0.675], (0.625, 4.163], (0.625, 4.163], (-0.0314, 0.625], [-3.451, -0.675], ..., (-0.0314, 0.625], (-0.0314, 0.625], (-0.675, -0.0314], (-0.0314, 0.625], [-3.451, -0.675]]
Length: 1000
Categories (4, object): [[-3.451, -0.675] < (-0.675, -0.0314] < (-0.0314, 0.625] < (0.625, 4.163]]

In [54]:
pd.value_counts(cats)

(0.625, 4.163]       250
(-0.0314, 0.625]     250
(-0.675, -0.0314]    250
[-3.451, -0.675]     250
dtype: int64

Passing own quantiles

In [55]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[[-3.451, -1.33], (1.279, 4.163], (1.279, 4.163], (-0.0314, 1.279], [-3.451, -1.33], ..., (-0.0314, 1.279], (-0.0314, 1.279], (-1.33, -0.0314], (-0.0314, 1.279], [-3.451, -1.33]]
Length: 1000
Categories (4, object): [[-3.451, -1.33] < (-1.33, -0.0314] < (-0.0314, 1.279] < (1.279, 4.163]]

### 2.6 Detecting and Filtering Outlies

In [56]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [57]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.001043,-0.014349,-0.053024,-0.022516
std,0.998291,0.982212,1.018303,1.018277
min,-3.237502,-3.056962,-4.197482,-3.365287
25%,-0.660018,-0.721705,-0.720531,-0.714
50%,-0.009417,0.000915,-0.02657,-0.038265
75%,0.663598,0.659128,0.634119,0.665024
max,3.812908,2.946547,3.299196,3.623016


Select all rows having at least 1 value exceeding 3 or -3

In [65]:
data[(np.abs(data) > 3).any(axis=1)]

Unnamed: 0,0,1,2,3
50,-0.810155,-1.30783,-0.299653,-3.365287
187,0.415244,0.075217,3.254453,-1.376358
322,-3.009347,0.91445,0.825777,1.023562
522,1.125294,1.021758,-3.040628,1.645683
538,-0.102464,-0.749367,-4.197482,0.442056
645,-3.128886,1.919024,0.378674,-2.068048
667,0.133831,0.130382,-0.907492,3.623016
758,-0.876629,-0.520596,-3.078611,-1.121614
842,-3.237502,0.214094,-2.152631,-0.626796
879,0.770937,0.548447,3.299196,1.790977


Cap values exceeding 3 or -3 to the interval -3 to 3

In [66]:
# np.sign(data) returns 1 or -1 based on the sign of value
data[np.abs(data) > 3] = np.sign(data) * 3 

In [67]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.001753,-0.014268,-0.052261,-0.022773
std,0.993476,0.981963,1.012022,1.015095
min,-3.0,-3.0,-3.0,-3.0
25%,-0.660018,-0.721705,-0.720531,-0.714
50%,-0.009417,0.000915,-0.02657,-0.038265
75%,0.663598,0.659128,0.634119,0.665024
max,3.0,2.946547,3.0,3.0


### 2.7 Permutation and Random Sampling

In [68]:
df = pd.DataFrame(np.arange(20).reshape((5, 4)))

`np.random.permutation` function produces an array of integers indicating the new orders in the given length

In [69]:
sampler = np.random.permutation(5)

In [70]:
sampler

array([2, 1, 0, 3, 4])

In [71]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [79]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3
2,8,9,10,11
1,4,5,6,7
0,0,1,2,3
3,12,13,14,15
4,16,17,18,19


Or using the equivalent `take` function

In [85]:
df.take(sampler)

Unnamed: 0,0,1,2,3
2,8,9,10,11
1,4,5,6,7
0,0,1,2,3
3,12,13,14,15
4,16,17,18,19


Using the `sample` method on Series and DataFrame to select a random subset without replacement

In [88]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7
2,8,9,10,11


Generating a sample with replacement

In [90]:
choice = pd.Series([5, 7, -1, 6, 4])

In [91]:
draws = choice.sample(n=10, replace=True)

In [92]:
draws

0    5
1    7
0    5
1    7
3    6
4    4
3    6
3    6
4    4
2   -1
dtype: int64

### 2.8 Computing Indicator/Dummy Variables

Using `pd.get_dummies` function to convert a categorical variable into a "dummy" or "indicator" matrix

In [93]:
df = pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})

In [94]:
pd.get_dummies(df.key)

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


Adding a prefix to the columns

In [100]:
dummies = pd.get_dummies(df.key, prefix='key')

In [101]:
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [102]:
df_with_dummy = df[['data1']].join(dummies)

In [103]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories

In [104]:
!cat datasets/movielens/movies.dat | head -n 10

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
cat: stdout: Broken pipe


In [105]:
mnames = ['move_id', 'title', 'genres']

In [111]:
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                      header=None, names=mnames, engine='python')

In [112]:
movies.head()

Unnamed: 0,move_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


Extract a list of all genres

In [113]:
all_genres = list()

In [114]:
for g in movies.genres:
    all_genres.extend(g.split('|'))

In [116]:
genres = pd.unique(all_genres)

In [117]:
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

Creating a DataFrame of all zeros

In [128]:
zero_matrix = np.zeros((len(movies), len(genres)))

In [129]:
dummies = pd.DataFrame(zero_matrix, columns=genres)

Iterating through each movie and set entries in each row of dummies to 1. Using the `dummies.columns.get_indexer` to compute the column indices for each genre

In [130]:
dummies.columns.get_indexer(movies.genres[0].split('|'))

array([0, 1, 2])

In [131]:
for i, g in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(g.split('|'))
    dummies.iloc[i, indices] = 1

Combining the dummies with the mivies

In [133]:
movie_windic = movies.join(dummies.add_prefix('Genre_'))

In [134]:
movie_windic.iloc[0]

move_id                                        1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western       

Combining get_dummies with a discretization function like cut

In [135]:
np.random.seed(12345)

In [136]:
values = np.random.rand(10)

In [137]:
values

array([ 0.92961609,  0.31637555,  0.18391881,  0.20456028,  0.56772503,
        0.5955447 ,  0.96451452,  0.6531771 ,  0.74890664,  0.65356987])

In [143]:
bins = np.linspace(0, 1, 6, endpoint=True)

In [144]:
bins

array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])

In [145]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


## 3 String Manipulation

### 3.1 String Object Methods

`split` string

In [1]:
val = 'a,b,  guido'

In [2]:
val.split(',')

['a', 'b', '  guido']

Combined with `strip` to trim whitespace and line breaks

In [3]:
pieces = [x.strip() for x in val.split(',')]

In [4]:
pieces

['a', 'b', 'guido']

Using `join` method to concatenate strings

In [6]:
'::'.join(pieces)

'a::b::guido'

Using `in`, `index` and `find` method to locate substrings. `index` raises an exception if the string isn't found, `find` return -1

In [7]:
'guido' in val

True

In [8]:
val.find(':')

-1

In [9]:
val.index(':')

ValueError: substring not found

`count` returns the number of occurences of a particular substring

In [10]:
val.count(',')

2

In [13]:
'aaaa'.count('aa')

2

`replace` will substitute occurrences of one pattern for another.

In [11]:
val.replace(',', '::')

'a::b::  guido'

In [12]:
val.replace(',', '')

'ab  guido'

<img src='img/7_3_1.png'>

### 3.2 Regular Expressions

The `re` module functions fall into three categories: pattern matching, substitution, and splitting.

In [18]:
import re

Spliting a string with a variable number of whitespace characters.

In [20]:
text = "foo bar\t baz \tqux"

In [21]:
re.split(' +', text)

['foo', 'bar\t', 'baz', '\tqux']

In [22]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

The regular expression is first `compiled` when calling re.split. You can compile the regex yourself with `re.compile`, forming a reusable regex object, which is higly recommended if intend to apply the same expression to many strings; doing so will save CPU cycles

In [24]:
regex = re.compile('\s+')

In [25]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

Getting a list of all patterns matched

In [26]:
regex.findall(text)

[' ', '\t ', ' \t']

`match` and `search` are closely related to `findall`. `search` returns only the first match. `match` only matches at the beginning of the string

In [59]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

To avoid unwanted escaping with \ in a regular expression, prefix `r` tag to the string (indicating raw string literals) like r'\t' instead of the equivalent '\\t'

In [36]:
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [37]:
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [38]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

`search` returns a special match object for the first matching pattern in the text.

In [41]:
m = regex.search(text)

In [42]:
m

<_sre.SRE_Match at 0x108353648>

The match object can tell us the start and end position of the pattern in the string

In [64]:
m.start(), m.end(), m.span(), m.group()

(5, 20, (5, 20), 'dave@google.com')

In [53]:
text[m.start():m.end()]

'dave@google.com'

`regex.match` returns None, as it only will match if the pattern occurs at the start of the string. It also returns match object if the pattern is matched at the beginning 

In [57]:
print regex.match(text)

None


`sub` will return a new string with occurrences of the pattern replaced by a new string

In [60]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



Finding email addresses and simultaneously segment each address in to its three components: username, domain name, and domain suffix

In [72]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [77]:
regex = re.compile(pattern, flags=re.IGNORECASE)

A match object produced by this modified regex returns a tuple of the pattern components with its `groups` method

In [78]:
m = regex.match('wesm@bright.net')

In [79]:
m.groups()

('wesm', 'bright', 'net')

In [80]:
m.group()

'wesm@bright.net'

`findall` returns a list of tuples when the pattern has groups

In [81]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

`sub` also has access to groups in each match using special symbols like \1 and \2.

In [83]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



<img src='img/7_3_2.png'>

### 3.3 Vectorized String Functions in pandas

In [87]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [88]:
data = pd.Series(data)

In [89]:
data

Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

You can apply string and regular expression methods can be applied to each value using data.map, but it will fail on the NA values.

Series has array-oriented methods for string operation that skip NA values, which are accessed through Series's `str` attribute

In [92]:
data.str.contains('gmail')

Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

Regex can be used too, along with any `re` options like IGNORECASE

In [93]:
patter

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+\\).([A-Z]{2,4})'

In [94]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

There are couple of ways to do vectorized element retrieval. Either use `str.get` or index into the `str` attribute

In [101]:
matches = data.str.match(pattern, flags=re.IGNORECASE)

  """Entry point for launching an IPython kernel.


In [102]:
matches

Dave     (dave, google, com)
Rob        (rob, gmail, com)
Steve    (steve, gmail, com)
Wes                      NaN
dtype: object

In [103]:
matches.str.get(0)

Dave      dave
Rob        rob
Steve    steve
Wes        NaN
dtype: object

In [106]:
matches.str[:5]

Dave     (dave, google, com)
Rob        (rob, gmail, com)
Steve    (steve, gmail, com)
Wes                      NaN
dtype: object