##### All of the materials were taken from the book ``Python for Data Analysis: Data Wrangling with Pandas, Numpy and Jupyter'' by Wes McKinney, Third Edition, August 2022, Published by O'Reilly Media.

When doing data analysis and modeling, a significant amount of time
is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up 80% or more of an analyst’s time. Most of the time, the
way that data is stored in files or databases is not in the right format for a particular
task. Therefore, it is required to clean, transform, and rearrange the data before doing data analysis and modeling. In this tutorial, we go through some examples of such data processing.

## 1. Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default. This would produce a biased analysis, if the data were missing not at random.

The way that missing data is represented in pandas objects is somewhat imperfect,
but it is sufficient for most real-world use. For data with float64 dtype, pandas uses
the floating-point value NaN (Not a Number) to represent missing data.

However, if the data entries are not numerical values, it is also required to deal with missing values.

In [None]:
import numpy as np
import pandas as pd
# PREVIOUS_MAX_ROWS = pd.options.display.max_rows
# pd.options.display.max_rows = 25
# pd.options.display.max_columns = 20
# pd.options.display.max_colwidth = 82
# np.random.seed(12345)
# import matplotlib.pyplot as plt
# plt.rc("figure", figsize=(10, 6))
# np.set_printoptions(precision=4, suppress=True)

In [1]:
import numpy as np
import pandas as pd

Pandas uses NaN to represent missing data

In [3]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

The isna method gives us a Boolean Series with True where values are null:

In [4]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by referring
to missing data as NA, which stands for not available. In statistics applications,
NA data may either be data that does not exist or that exists but was not observed
(through problems with data collection, for example). When cleaning up data for
analysis, it is often important to do analysis on the missing data itself to identify data
collection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA:

In [5]:
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data
string_data.isna()
float_data = pd.Series([1, 2, None], dtype='float64')
float_data
float_data.isna()

0    False
1    False
2     True
dtype: bool

#### Filtering out missing data

There are a few ways to filter out missing data. While you always have the option to
do it by hand using pandas.isna and Boolean indexing, dropna can be helpful. On a
Series, it returns the Series with only the nonnull data and index values:

In [11]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is the same thing as doing:

In [12]:
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, there are different ways to remove missing data. You may
want to drop rows or columns that are all NA, or only those rows or columns
containing any NAs at all. dropna by default drops any row containing a missing
value:

In [13]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how="all" will drop only rows that are all NA:

In [14]:
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Keep in mind that these functions return new objects by default and do not modify
the contents of the original object.

To drop columns in the same way, pass axis="columns":

In [15]:
data[4] = np.nan
data
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose you want to keep only rows containing at most a certain number of missing
observations. You can indicate this with the thresh argument:

In [61]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df
df.dropna()
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-0.593451,,0.217147
3,-0.631937,,-0.077906
4,0.02237,-0.794904,0.036114
5,0.836976,-0.22551,1.752678
6,-1.141588,-0.498659,-0.386654


#### Filling in missing data

Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. For most
purposes, the fillna method is the workhorse function to use. Calling fillna with a
constant replaces missing values with that value:

In [62]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.492108,0.0,0.0
1,0.334588,0.0,0.0
2,-0.593451,0.0,0.217147
3,-0.631937,0.0,-0.077906
4,0.02237,-0.794904,0.036114
5,0.836976,-0.22551,1.752678
6,-1.141588,-0.498659,-0.386654


Calling fillna with a dictionary, you can use a different fill value for each column:

In [18]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.75558,0.5,0.0
1,-0.099767,0.5,0.0
2,0.650207,0.5,0.504136
3,0.110072,0.5,1.327772
4,-0.078091,0.662501,-0.057263
5,-0.441102,-1.274081,0.288202
6,-0.427119,0.708112,0.317421


The method `ffill` will fill the missing value in the dataframe, it stands for `forward fill` and will propagate last valid observation forward. The argument `limit=2` means only fill two missing values forward:

In [65]:
df = pd.DataFrame(np.random.standard_normal((6, 3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df
df.ffill()
df.ffill(limit=2)

Unnamed: 0,0,1,2
0,0.598438,-1.098419,-0.344582
1,0.536808,0.919844,1.348353
2,0.996751,0.919844,-0.865126
3,1.09253,0.919844,-1.661022
4,-1.832317,,-1.661022
5,0.996806,,-1.661022


With `fillna` you can do lots of things such as simple data imputation using the
median or mean statistics:

In [67]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

## 2. Data Transformation

We have been concerned with handling missing data. Filtering,
cleaning, and other transformations are another class of important operations.

#### Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an
example:

In [68]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                     "k2": [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method duplicated returns a Boolean Series indicating whether
each row is a duplicate (its column values are exactly equal to those in an earlier row)
or not:

In [22]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated
array is False filtered out:

In [23]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both methods by default consider all of the columns; alternatively, you can specify
any subset of them to detect duplicates. Suppose we had an additional column of
values and wanted to filter duplicates based only on the "k1" column:

In [69]:
data["v1"] = range(7)
data
data.drop_duplicates(subset=["k1"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination.
Passing keep="last" will return the last one:

In [25]:
data.drop_duplicates(["k1", "k2"], keep="last")

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


#### Transforming Data Using a Function or Mapping
For many datasets, you may wish to perform some transformation based on the
values in an array, Series, or column in a DataFrame. Consider the following hypothetical
data collected about various kinds of meat:

In [26]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                              "pastrami", "corned beef", "bacon",
                              "pastrami", "honey ham", "nova lox"],
                     "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food
came from. Let’s write down a mapping of each distinct meat type to the kind of
animal:

In [27]:
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

The map method on a Series accepts a function or dictionary-like object containing a mapping to do
the transformation of values:

In [28]:
data["animal"] = data["food"].map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [29]:
def get_animal(x):
    return meat_to_animal[x]
data["food"].map(get_animal)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other
data cleaning-related operations.

#### Replacing values
Filling in missing data with the fillna method is a special case of more general value
replacement. As you’ve already seen, map can be used to modify a subset of values
in an object, but replace provides a simpler and more flexible way to do so. Let’s
consider this Series:

In [70]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:

In [31]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the
substitute value:

In [32]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [71]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dictionary:

In [34]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

#### Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or
mapping of some form to produce new, differently labeled objects. You can also
modify the axes in place without creating a new data structure. Here’s a simple
example:

In [74]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["Ohio", "Colorado", "New York"],
                    columns=["one", "two", "three", "four"])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Like a Series, the axis indexes have a map method:

In [75]:
def transform(x):
    return x[:4].upper()

data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

You can assign to the index attribute, modifying the DataFrame in place:

In [76]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the
original, a useful method is rename:

In [77]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [79]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Notably, rename can be used in conjunction with a dictionary-like object, providing
new values for a subset of the axis labels:

In [78]:
data.rename(index={"OHIO": "INDIANA"},
            columns={"three": "peekaboo"})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


rename saves you from the chore of copying the DataFrame manually and assigning
new values to its index and columns attributes.

#### Discretization and Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose you have data about a group of people in a study, and you want to group
them into discrete age buckets:

In [80]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To
do so, you have to use pandas.cut:

In [41]:
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages, bins)
age_categories

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. The output you see
describes the bins computed by pandas.cut. Each bin is identified by a special
(unique to pandas) interval value type containing the lower and upper limit of each
bin:

In [42]:
age_categories.codes
age_categories.categories
age_categories.categories[0]
pd.value_counts(age_categories)

  pd.value_counts(age_categories)


(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

Note that pd.value_counts(categories) are the bin counts for the result of
pandas.cut.

In the string representation of an interval, a parenthesis means that the side is open
(exclusive), while the square bracket means it is closed (inclusive). You can change
which side is closed by passing right=False:

In [43]:
pd.cut(ages, bins, right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64, left]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

You can override the default interval-based bin labeling by passing a list or array to
the labels option:

In [44]:
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to pandas.cut instead of explicit bin edges, it
will compute equal-length bins based on the minimum and maximum values in the
data. Consider the case of some uniformly distributed data chopped into fourths:

In [45]:
data = np.random.uniform(size=20)
pd.cut(data, 4, precision=2)

[(0.52, 0.76], (0.52, 0.76], (0.76, 1.0], (0.044, 0.28], (0.28, 0.52], ..., (0.76, 1.0], (0.044, 0.28], (0.52, 0.76], (0.76, 1.0], (0.044, 0.28]]
Length: 20
Categories (4, interval[float64, right]): [(0.044, 0.28] < (0.28, 0.52] < (0.52, 0.76] < (0.76, 1.0]]

The precision=2 option limits the decimal precision to two digits.
A closely related function, pandas.qcut, bins the data based on sample quantiles.
Depending on the distribution of the data, using pandas.cut will not usually result in each bin having the same number of data points. Since pandas.qcut uses sample
quantiles instead, you will obtain roughly equally sized bins:

In [46]:
data = np.random.standard_normal(1000)
quartiles = pd.qcut(data, 4, precision=2)
quartiles
pd.value_counts(quartiles)

  pd.value_counts(quartiles)


(-3.4, -0.78]      250
(-0.78, -0.026]    250
(-0.026, 0.67]     250
(0.67, 3.32]       250
Name: count, dtype: int64

Similar to pandas.cut, you can pass your own quantiles (numbers between 0 and 1,
inclusive):

In [47]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]).value_counts()

(-3.395, -1.424]    100
(-1.424, -0.026]    400
(-0.026, 1.261]     400
(1.261, 3.316]      100
Name: count, dtype: int64

#### Detecting and filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations.
Consider a DataFrame with some normally distributed data:

In [81]:
data = pd.DataFrame(np.random.standard_normal((1000, 4)))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.022552,-0.00363,0.005575,0.004298
std,1.033656,1.014454,0.995348,0.956957
min,-3.227501,-3.162287,-3.751185,-2.822679
25%,-0.655004,-0.696647,-0.670643,-0.637073
50%,0.028172,0.012955,0.023912,-0.009255
75%,0.684961,0.667048,0.674933,0.631351
max,2.974329,3.088832,2.946164,3.326923


Suppose you wanted to find values in one of the columns exceeding 3 in absolute
value:

In [82]:
col = data[2]
col[col.abs() > 3]

229   -3.751185
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the any method on a
Boolean DataFrame:

In [50]:
data[(data.abs() > 3).any(axis="columns")]

Unnamed: 0,0,1,2,3
41,-1.058946,0.108604,1.010567,3.119444
303,3.006409,0.014569,-0.801743,-1.330348
309,1.272735,-0.558303,3.022455,-0.827212
440,-0.388091,0.45544,3.251562,-0.346545
446,-0.078396,-3.050441,0.219601,0.314125
570,0.389532,1.141602,3.232466,-0.17398
571,0.135324,3.604039,0.148931,0.744126
600,-3.134964,-1.034478,1.149638,-0.629533
601,1.295051,-1.793965,3.20939,-0.519859
632,4.097412,1.933595,-1.257332,1.121858


The parentheses around data.abs() > 3 are necessary in order to call the any
method on the result of the comparison operation.

Values can be set based on these criteria. Here is code to cap values outside the
interval –3 to 3:

In [51]:
data[data.abs() > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.022308,-0.034723,0.071749,0.022664
std,1.036538,0.990546,1.009305,0.937835
min,-3.0,-3.0,-2.880561,-2.689477
25%,-0.719153,-0.702378,-0.629438,-0.613895
50%,-0.02657,-0.066718,0.07496,0.023582
75%,0.654451,0.57313,0.754719,0.636417
max,3.0,3.0,3.0,3.0


The statement np.sign(data) produces 1 and –1 values based on whether the values
in data are positive or negative:

In [52]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,1.0,1.0
1,-1.0,1.0,1.0,-1.0
2,1.0,1.0,-1.0,-1.0
3,-1.0,-1.0,1.0,1.0
4,-1.0,1.0,1.0,1.0


## 3. String manipulation

Python has long been a popular raw data manipulation language in part due to its
ease of use for string and text processing. Most text operations are made simple
with the string object’s built-in methods. For more complex pattern matching and
text manipulations, regular expressions may be needed. pandas adds to the mix by
enabling you to apply string and regular expressions concisely on whole arrays of
data, additionally handling the annoyance of missing data.

#### Python Built-In String Object Methods

In many string munging and scripting applications, built-in string methods are
sufficient. As an example, a comma-separated string can be broken into pieces with
split:

In [83]:
val = "a,b, guido"
val.split(",")

['a', 'b', ' guido']

split is often combined with strip to trim whitespace (including line breaks):

In [84]:
pieces = [x.strip() for x in val.split(",")]
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using
addition:

In [85]:
first, second, third = pieces
first + "::" + second + "::" + third

'a::b::guido'

But this isn’t a practical generic method. A faster and more Pythonic way is to pass a
list or tuple to the join method on the string "::":

In [86]:
"::".join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python’s in keyword is
the best way to detect a substring, though index and find can also be used:

In [87]:
"guido" in val
val.index(",")
val.find(":")

-1

Note that the difference between find and index is that index raises an exception if
the string isn’t found (versus returning –1):

In [88]:
val.index(":")

ValueError: substring not found

Relatedly, count returns the number of occurrences of a particular substring:

In [89]:
val.count(",")

2

replace will substitute occurrences of one pattern for another. It is commonly used
to delete patterns, too, by passing an empty string:

In [90]:
val.replace(",", "::")
val.replace(",", "")

'ab guido'

### Python built-in string methods

count: Return the number of nonoverlapping occurrences of substring in the string

endswith: Return True if string ends with suffix

startswith: Return True if string starts with prefix

join: Use string as delimiter for concatenating a sequence of other strings

index: Return starting index of the first occurrence of passed substring if found in the string; otherwise, raises ValueError if not found

find: Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found

rfind: Return position of first character of last occurrence of substring in the string; returns –1 if not found

replace: Replace occurrences of string with another string

strip,
rstrip,
lstrip
Trim whitespace, including newlines on both sides, on the right side, or on the left side, respectively

split: Break string into list of substrings using passed delimiter

lower: Convert alphabet characters to lowercase

upper: Convert alphabet characters to uppercase

casefold: Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form

ljust,
rjust:
Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill
character) to return a string with a minimum width

#### Regular expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string
formed according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings; I’ll give a number of examples
of its use here.

The `re` module functions fall into three categories: pattern matching, substitution,
and splitting. Naturally these are all related; a regex describes a pattern to locate in
the text, which can then be used for many purposes. Let’s look at a simple example:
suppose we wanted to split a string with a variable number of whitespace characters
(tabs, spaces, and newlines).

The regex describing one or more whitespace characters is \s+:

In [91]:
import re
text = "foo    bar\t baz  \tqux"
re.split(r"\s+", text)

['foo', 'bar', 'baz', 'qux']

When you call re.split(r"\s+", text), the regular expression is first compiled, and
then its split method is called on the passed text. You can compile the regex yourself
with re.compile, forming a reusable regex object:

In [92]:
regex = re.compile(r"\s+")
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the
findall method:

In [None]:
regex.findall(text)

Creating a regex object with re.compile is highly recommended if you intend to
apply the same expression to many strings; doing so will save CPU cycles.

`match` and `search` are closely related to findall. While findall returns all matches
in a string, search returns only the first match. More rigidly, match only matches at
the beginning of the string. As a less trivial example, let’s consider a block of text and
a regular expression capable of identifying most email addresses:

In [93]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com"""
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"

# re.IGNORECASE makes the regex case insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

Using findall on the text produces a list of the email addresses:

In [94]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

`search` returns a special match object for the first email address in the text. For the
preceding regex, the match object can only tell us the start and end position of the
pattern in the string:

In [95]:
m = regex.search(text)
m
text[m.start():m.end()]

'dave@google.com'

`regex.match` returns None, as it will match only if the pattern occurs at the start of the
string:

In [96]:
print(regex.match(text))

None


Relatedly, sub will return a new string with occurrences of the pattern replaced by a
new string:

In [97]:
print(regex.sub("REDACTED", text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED


Suppose you wanted to find email addresses and simultaneously segment each
address into its three components: username, domain name, and domain suffix. To
do this, put parentheses around the parts of the pattern to segment:

In [98]:
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
regex = re.compile(pattern, flags=re.IGNORECASE)

A match object produced by this modified regex returns a tuple of the pattern
components with its groups method:

In [None]:
m = regex.match("wesm@bright.net")
m.groups()

`findall` returns a list of tuples when the pattern has groups:

In [99]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

sub also has access to groups in each match using special symbols like \1 and \2. The
symbol \1 corresponds to the first matched group, \2 corresponds to the second, and
so forth:

In [100]:
print(regex.sub(r"Username: \1, Domain: \2, Suffix: \3", text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com


#### String functions in Pandas

Cleaning up a messy dataset for analysis often requires a lot of string manipulation.
To complicate matters, a column containing strings will sometimes have missing data:

In [101]:
data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
        "Rob": "rob@gmail.com", "Wes": np.nan}
data = pd.Series(data)
data
data.isna()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

String and regular expression methods can be applied (passing a lambda or other
function) to each value using data.map, but it will fail on the NA (null) values.
To cope with this, Series has array-oriented methods for string operations that skip
over and propagate NA values. These are accessed through Series’s str attribute;
for example, we could check whether each email address has "gmail" in it with
str.contains:

In [102]:
data.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Note that the result of this operation has an object dtype. pandas has extension types
that provide for specialized treatment of strings, integers, and Boolean data which
until recently have had some rough edges when working with missing data:

In [103]:
data_as_string_ext = data.astype('string')
data_as_string_ext
data_as_string_ext.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes       <NA>
dtype: boolean

Regular expressions can be used, too, along with any `re` options like IGNORECASE:

In [104]:
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or
index into the str attribute:

In [105]:
matches = data.str.findall(pattern, flags=re.IGNORECASE).str[0]
matches
matches.str.get(1)

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object

You can similarly slice strings using this syntax:

In [106]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

The `str.extract` method will return the captured groups of a regular expression as a
DataFrame:

In [108]:
data.str.extract(pattern, flags=re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,
